Python Tutorial - Bootstrap confidence intervals

The Importance of Exploratory Data Analysis (EDA) in Quantifying Active Bout Lengths of Zebrafish

Exploratory data analysis, also known as EDA, is an essential step in understanding and visualizing datasets. As John Tukey once said, "Exploratory data analysis can never be the whole story, but nothing else can serve as a foundation stone as the first step in this course and throughout your data science endeavors in general." This quote emphasizes the importance of starting with EDA to gain insights into the data before moving on to more advanced analyses.

In the context of zebrafish research, EDA is crucial for investigating active bouts. By applying EDA techniques, researchers can gain a deeper understanding of the characteristics of these bouts and develop a foundation for further analysis. In this article, we will explore the use of EDA in quantifying active bout lengths of zebrafish.

Active Bouts are Roughly Exponentially Distributed

One of the key findings from previous exercises is that active bouts are roughly exponentially distributed. This means that the time between arrivals of a Poisson process follows an exponential distribution, which has a single parameter that describes the characteristic time between arrivals. The value of this parameter is computed from the mean of all active bout lengths.

Using the Nuclear Incident Data

To illustrate how this works in practice, let's examine the nuclear incident data. We can use the NPD mean function to compute the mean of all inter-incident times, which is 87 days indicated by the vertical gray line on the plot. However, we need to consider how confident we are in this value. What if we could somehow measure a collection of inter-incident times again? Would we get the same value for the mean?

Simulating with Bootstrap Sampling

To simulate this, we can draw a bootstrap sample specifically by resampling the data with replacement using the NPD random choice function. We can then plot the eCDF of the resampled data along with the mean inter-incident time computed from this resampled data set. The result is slightly different from what we got from the original data.

Each Value of the Mean Inter-incident Time is a Bootstrap Replicate

In this procedure, each value of the mean inter-incident time is a bootstrap replicate. A bootstrap replicant is a statistic computed from a resampled data set in this case that statistic is the mean. The DC stat think module has a function to compute bootstrap replicates from a data set, allowing us to draw 10,000 replicates of the mean from a data set.

Looking at the Plot of Replicas

When we look at the plot of the replicas shown by the vertical gray lines, we see that the replicates lie somewhere between about 70 and 100 days. This is roughly the bootstrap confidence interval of the mean inter-incident time. Generally, a P percent confidence interval can be defined as follows: if we repeated measurements over and over again, P percent of the observed values would lie within the p percent confidence interval.

Computing the Confidence Interval

We can compute the 95% confidence interval by taking the percentiles of the bootstrap replicates. Specifically, we compute the 2-point fifth and ninety-seven point fifth percentiles using the num pies percentile function. The first argument is an array containing the bootstrap replicates, and the second is a list or tuple with the desired percentiles.

The Resulting Confidence Interval

We get the 95% confidence interval spans from 73 to 102 days. Now that we have computed optimal parameters and obtained bootstrap confidence intervals, we can quantify active bout lengths of wild-type and mutant fish.

Conclusion

In conclusion, EDA is a crucial step in understanding and visualizing datasets. By applying EDA techniques to the nuclear incident data, researchers can gain insights into the characteristics of active bouts and develop a foundation for further analysis. The use of bootstrap sampling allows us to simulate different scenarios and compute confidence intervals, providing a more robust estimate of the mean inter-incident time.

"WEBVTTKind: captionsLanguage: enyou now have used graphical exploratory data analysis or EDA to investigate the active bouts of the zebrafish I remind you of one of my favorite quotes from John Tukey exploratory data analysis can never be the whole story but nothing else can serve as a foundation stone as the first step in this course and throughout your data science endeavors in general it is important to heed to Kies advice and start with EDA now that you have done some EDA let's start progressing toward the whole story you saw in the previous exercises that the active bout lengths are roughly exponentially distributed the exponential distribution has a single parameter that describes the characteristic time between arrivals of a Poisson process the value of the parameter that best describes the data is computed from the mean of all the active bout lengths thus the mean computed from the data is the optimal parameter value let's look at how this is done with the nuclear incident data we can use the NPD mean function to compute the mean of all inter incident times which is 87 days indicated by the vertical gray line on the plot but how confident are we in this value what if we could somehow measure a collection of inter incident times again what would we get for the mean we can simulate this by drawing a bootstrap sample specifically we resample the data with replacement using the NPD random choice function we can plot the e CDF of the resampled data along with the mean inter incident time computed from this resampled data set we get a slightly different value than we got from the original data we can do this procedure again and again and again and again and again and again and again each value of the mean inter incident time is a bootstrap replicate a bootstrap replicant is a statistic computed from a resampled data set in this case that statistic is the mean the DC stat think module has a function to compute bootstrap replicates from a data set for example you can use it to draw 10,000 replicates of the mean from a data set in looking at the plot of the replicas shown by the vertical gray lines we see that the replicates lie somewhere between about 70 and a hundred days this is roughly the bootstrap confidence interval of the mean inter incident time generally a P percent confidence interval can be defined as follows if we repeated measurements over and over again P percent of the observed values would lie within the p percent confidence interval because the bootstrap replicates are simulating measurements over and over again we can simply take the percentiles of the bootstrap replicates to compute the confidence interval for the 95% confidence interval we compute the 2-point fifth and ninety seven point fifth percentiles we can do that using num pies percentile function the first argument is an array containing the bootstrap replicates and the second is a list or tuple with the desired percentiles we get the 95% confidence interval spans from 73 to 102 days now that you AR Rifa millyar i's dwith computing optimal parameters and obtaining bootstrap confidence intervals you can quantify active bout lengths of wild-type and mutant fishyou now have used graphical exploratory data analysis or EDA to investigate the active bouts of the zebrafish I remind you of one of my favorite quotes from John Tukey exploratory data analysis can never be the whole story but nothing else can serve as a foundation stone as the first step in this course and throughout your data science endeavors in general it is important to heed to Kies advice and start with EDA now that you have done some EDA let's start progressing toward the whole story you saw in the previous exercises that the active bout lengths are roughly exponentially distributed the exponential distribution has a single parameter that describes the characteristic time between arrivals of a Poisson process the value of the parameter that best describes the data is computed from the mean of all the active bout lengths thus the mean computed from the data is the optimal parameter value let's look at how this is done with the nuclear incident data we can use the NPD mean function to compute the mean of all inter incident times which is 87 days indicated by the vertical gray line on the plot but how confident are we in this value what if we could somehow measure a collection of inter incident times again what would we get for the mean we can simulate this by drawing a bootstrap sample specifically we resample the data with replacement using the NPD random choice function we can plot the e CDF of the resampled data along with the mean inter incident time computed from this resampled data set we get a slightly different value than we got from the original data we can do this procedure again and again and again and again and again and again and again each value of the mean inter incident time is a bootstrap replicate a bootstrap replicant is a statistic computed from a resampled data set in this case that statistic is the mean the DC stat think module has a function to compute bootstrap replicates from a data set for example you can use it to draw 10,000 replicates of the mean from a data set in looking at the plot of the replicas shown by the vertical gray lines we see that the replicates lie somewhere between about 70 and a hundred days this is roughly the bootstrap confidence interval of the mean inter incident time generally a P percent confidence interval can be defined as follows if we repeated measurements over and over again P percent of the observed values would lie within the p percent confidence interval because the bootstrap replicates are simulating measurements over and over again we can simply take the percentiles of the bootstrap replicates to compute the confidence interval for the 95% confidence interval we compute the 2-point fifth and ninety seven point fifth percentiles we can do that using num pies percentile function the first argument is an array containing the bootstrap replicates and the second is a list or tuple with the desired percentiles we get the 95% confidence interval spans from 73 to 102 days now that you AR Rifa millyar i's dwith computing optimal parameters and obtaining bootstrap confidence intervals you can quantify active bout lengths of wild-type and mutant fish\n"