Spreadsheets Tutorial - How far from average

# Understanding Data Distribution: Variance, Standard Deviation, Percentiles, and Quartiles

## Introduction

In this guide, we will explore how to measure a data point's distance from the average, focusing on key statistical concepts such as variance, standard deviation, percentiles, and quartiles. These metrics are essential for understanding how data is spread and distributed.

---

## Understanding Variance

Variance measures how dispersed a dataset is from its mean. A smaller variance indicates that the data points are closely clustered around the mean, while a larger variance suggests greater dispersion among the data points.

### Example: Calculating Variance

Consider three columns of data:

- **Column A**: [10, 10, 10, 10] (variance = 0)

- **Column B**: [10, 14, 10, 10] (variance = 3)

- **Column C**: [10, 14, 10, 100] (variance is the highest among the three)

To calculate variance:

1. Calculate the mean:

\[

\text{Mean} = \frac{10 + 14 + 10 + 10}{4} = 11

\]

2. Subtract the mean from each value:

- For the first, third, and fourth values: \(10 - 11 = -1\)

- For the second value: \(14 - 11 = 3\)

3. Square all these differences:

- \((-1)^2 = 1\)

- \(3^2 = 9\)

4. Take another average of the squared differences:

\[

\text{Variance} = \frac{1 + 9 + 1 + 1}{4} = \frac{12}{4} = 3

\]

---

## Using Formulas for Variance and Standard Deviation

Thankfully, there are formulas to simplify these calculations:

- Use `VAR.P` to calculate variance. For example:

\[

\text{Variance} = \text{VAR.P}(10, 14, 10, 10) = 3

\]

- To make the results more intuitive (since variance is in squared units), take the square root of the variance to get the standard deviation. Use `STDEV.P` for this purpose:

\[

\text{Standard Deviation} = \sqrt{\text{Variance}} \quad \text{or} \quad \text{STDEV.P}(10, 14, 10, 10) = 1.73

\]

---

## Standard Scores and Their Interpretation

A standard score (z-score) shows how many standard deviations a data point is away from the mean.

- **Mean**: 11

- **Standard Deviation**: 1.73

For example, consider a new data point: 12.73. To find its standard score:

\[

\text{Standard Score} = \frac{(12.73 - 11)}{1.73} = 1

\]

This means the new data point is exactly one standard deviation away from the mean.

---

## Exploring Percentiles and Quartiles

Percentiles help you understand where a particular value stands in a dataset by showing the percentage of values below that point.

### Example: A Histogram with 1,000,000 Values

- **Blue Line (Average = 0)**: This is the 50th percentile because half the points are less than zero and half are greater.

- **Red Line (-0.67)**: This represents the first quartile (25th percentile), meaning 25% of the data is less than or equal to -0.67.

- **Green Line (0.67)**: This represents the third quartile (75th percentile), meaning 75% of the data is less than or equal to 0.67.

### Calculating Quartiles in Sheets

Use the `QUARTILE` function:

\[

\text{First Quartile} = \text{QUARTILE}(array, 1) \\

\text{Second Quartile (Median)} = \text{QUARTILE}(array, 2) \\

\text{Third Quartile} = \text{QUARTILE}(array, 3) \\

\text{Fourth Quartile} = \text{QUARTILE}(array, 4)

\]

For example:

- **First Quartile**: 234

- **Second Quartile (Median)**: 456

- **Third Quartile**: 567

- **Fourth Quartile**: 789

---

## Conclusion

Understanding variance, standard deviation, percentiles, and quartiles is crucial for analyzing data distributions. These metrics help you make sense of how data points are spread and positioned relative to each other.

By practicing these concepts with exercises based on U.S. train ridership data, you can gain hands-on experience in exploring how ridership varies over time.

"WEBVTTKind: captionsLanguage: enlet's now learn how to measure a data points distance from the average the exercises following this video will explore us train ridership to understand how it varies over time so jump aboard the stats train variance measures how dispersed a dataset is from its mean the smaller the variance the less spread the data is conversely large differences between data points increase the variance column a repeats with no variation its variance is 0 in column B 1 value 14 is different yet close to the others its variance is 3 column C has an outlier 100 as a result its variance is the highest among the three to calculate variance first calculate the mean 10 14 10 and 10 divided by 4 equals 11 next subtract the mean from each value for the first third and fourth values 10 minus 11 is negative 1 for the second value 14 minus 11 leaves 3 easy huh in the third step square all these differences from the average negative one squares to one and three squared equals nine finally take another average of the squared differences one plus nine plus one plus one equals twelve divided by four equals three that was easy but a bit cumbersome thankfully there is a formula to calculate variance simply call the ARP with an array as shown in this example in which I calculate the variance for all three columns next stop standard deviation keep in mind variance is the average of squared values thus the variance is different from the original sample values making it less intuitive most often you will need to make sense of the variation by putting it in the scale of the original data this is done by taking the square root of the variance called standard deviation after taking the variance with the VAR p use SQ RT square root to calculate the standard deviation more easily you can pass an array into stdev P to get the same answer here 1.73 standard scores show you how a data point relates to the distribution our previous population mean was 11 and standard deviation was one point seven three now we have a new data point twelve point seven three subtracting the standard deviation twelve point seven three minus one point seven three you get back to the mean of eleven thus this new data point is exactly one standard deviation away from the mean another statistic for understanding a distribution is a percentile ordering a distribution and calculating the percentage of values below a specific point will tell you its percentile this histogram visualizes 1 million values the blue line averaged at zero is the 50th percentile because it splits the data evenly half the points are less than zero and half are greater quartiles are percentiles that segment the data into four chunks the red line at negative 0.67 demonstrates 25% of the data is less than or to the left of negative 0.67 another 25% of the data is greater than negative 0.67 but less than the blue average zero line the next 25% chunk of the data is greater than zero to the right of the blue line but less than the Green Line at 0.67 finally the remaining 25% of the data points are greater than 0.67 to the right of the green line to get the popular percentiles in sheets use the quartile function accepting an array then a number one through four to specify the quartile as you can see here the first quartile is 234 the second is 456 the third is 567 and the fourth is 789 excellent progress now let'slet's now learn how to measure a data points distance from the average the exercises following this video will explore us train ridership to understand how it varies over time so jump aboard the stats train variance measures how dispersed a dataset is from its mean the smaller the variance the less spread the data is conversely large differences between data points increase the variance column a repeats with no variation its variance is 0 in column B 1 value 14 is different yet close to the others its variance is 3 column C has an outlier 100 as a result its variance is the highest among the three to calculate variance first calculate the mean 10 14 10 and 10 divided by 4 equals 11 next subtract the mean from each value for the first third and fourth values 10 minus 11 is negative 1 for the second value 14 minus 11 leaves 3 easy huh in the third step square all these differences from the average negative one squares to one and three squared equals nine finally take another average of the squared differences one plus nine plus one plus one equals twelve divided by four equals three that was easy but a bit cumbersome thankfully there is a formula to calculate variance simply call the ARP with an array as shown in this example in which I calculate the variance for all three columns next stop standard deviation keep in mind variance is the average of squared values thus the variance is different from the original sample values making it less intuitive most often you will need to make sense of the variation by putting it in the scale of the original data this is done by taking the square root of the variance called standard deviation after taking the variance with the VAR p use SQ RT square root to calculate the standard deviation more easily you can pass an array into stdev P to get the same answer here 1.73 standard scores show you how a data point relates to the distribution our previous population mean was 11 and standard deviation was one point seven three now we have a new data point twelve point seven three subtracting the standard deviation twelve point seven three minus one point seven three you get back to the mean of eleven thus this new data point is exactly one standard deviation away from the mean another statistic for understanding a distribution is a percentile ordering a distribution and calculating the percentage of values below a specific point will tell you its percentile this histogram visualizes 1 million values the blue line averaged at zero is the 50th percentile because it splits the data evenly half the points are less than zero and half are greater quartiles are percentiles that segment the data into four chunks the red line at negative 0.67 demonstrates 25% of the data is less than or to the left of negative 0.67 another 25% of the data is greater than negative 0.67 but less than the blue average zero line the next 25% chunk of the data is greater than zero to the right of the blue line but less than the Green Line at 0.67 finally the remaining 25% of the data points are greater than 0.67 to the right of the green line to get the popular percentiles in sheets use the quartile function accepting an array then a number one through four to specify the quartile as you can see here the first quartile is 234 the second is 456 the third is 567 and the fourth is 789 excellent progress now let's\n"