Intro to statistics with R - Three Measures of Central Tendency

**Understanding Measures of Central Tendency: A Guide to Mean, Median, and Mode**

In statistics, measures of central tendency are used to describe the middle value of a dataset. These measures provide valuable insights into the data distribution, helping us understand patterns and trends. The three most common measures of central tendency are mean, median, and mode. In this article, we will delve into each measure, exploring their characteristics, advantages, and limitations.

**The Mean: A Good Measure for Normal Distributions**

The mean is the average value of a dataset. It's calculated by summing up all the values and dividing by the number of observations. The mean is a good measure of central tendency when the data follows a normal distribution. In this case, the mean, median, and mode are likely to be similar, as the data is symmetric around the average value. However, if the data is skewed or has extreme values, the mean can be biased.

For example, let's consider the ratings of the Red Red Wine. This classic dataset is a good illustration of a normal distribution. The mean rating is close to the median rating, and both are relatively stable across different vintages. However, if we were to look at household income in the United States, we would find that the data follows an extreme positive skew, making the mean more susceptible to bias.

**The Median: A Better Option for Skewed Distributions**

When dealing with skewed distributions, particularly those with extreme values on one end or the other, the median is often a better choice. The median is the middle value of the dataset when it's arranged in order. It's less affected by outliers and provides a more accurate representation of the data.

In the case of household income in the United States, for instance, the median salary is significantly lower than the mean salary due to the presence of extremely high-income earners. This highlights the importance of using the median as a measure of central tendency when dealing with skewed distributions.

**The Mode: A Measure of Central Tendency for Nominal Variables**

The mode is the score that occurs most frequently in the dataset. It's useful for nominal variables, which are categorical or qualitative data. The mode provides valuable information about the distribution of these types of variables.

In the context of baby names, we can use the mode to identify the most common names across different countries. For example, in the United States, the most common female baby name is Sophia, while in France, it's Emma. This demonstrates how the mode can be applied to nominal variables to provide insights into patterns and trends.

**A Comparison of Measures of Central Tendency**

To illustrate the differences between mean, median, and mode, let's consider a histogram of household income in Australia. The distribution is skewed to the right, with extreme values on the positive end. In this case, the mean salary is inflated due to these high-income earners.

The median salary, however, provides a more accurate representation of the data, as it's less affected by outliers. This highlights the importance of choosing the right measure of central tendency for the specific dataset at hand. When dealing with skewed distributions, the median may be a better option than the mean.

In conclusion, understanding measures of central tendency is essential in statistics. While the mean provides a good representation of normal distributions, the median is often a better choice when dealing with skewed distributions. The mode, on the other hand, is useful for nominal variables and can provide valuable insights into patterns and trends. By choosing the right measure of central tendency, we can gain a deeper understanding of our data and make more informed decisions.

"WEBVTTKind: captionsLanguage: enso to be clear what do I mean by a measure of central tendency it's just a measure or a summary statistic that describes the middle point or is representative of the middle point of a distribution so it should be representative of the distribution as a whole of course we can't describe the whole distribution with one number right this is just what statisticians call A Point estimate uh but it should do a pretty good job it should be representative the most common examples of measures of central tendency in statistics are the mean the median and the mode so the mean we saw is just sum up all the scores divide by the number of scores it's the mean or the average the median is just the middle score in a distribution so if I just lined up all of the rankings all of the ratings on the wines from the highest ranking to the lowest ranking and just plucked out the middle ranking say the 50th ranking out of the 100 that would be my median uh ranking and the mode is easy that's just the score that occurs most often in the distribution and if now that you've seen histograms and know how to plot histograms it's real easy to see just look for the peak of the histogram that's the mode so the mean or the the average is by far the most common measure of central tendency used in in statistics it's the one we're going to rely on the most in this course uh in any intro stats course that's the one that you're going to rely on the most um so for example uh the Red Red Wine ratings another classic example is just your grade point average and those are the the mean is the best when you have a normal distribution if you don't have a normal distribution in particular if you have a distribution with really extreme scores so an extreme positive skew or an extreme negative skew then you might rather go with the median rather than the mean and that's because those few extreme scores can really bias the mean whereas the median if you just line the scores up and take the middle one those extreme scores are not going to bi us the median as much as they're going to bi us the mean so so we can look at the white wine ratings as an example of that but those weren't that skewed so we'll look at them so you won't see a big difference between the mean and Medi and you'll see a slight difference uh another example it's a classic example right now in in the US and and lots of uh modern countries right now is household income so household income particularly in the US right now uh particularly in Brazil right now um is really not normally distributed so there's a real positive skew in in the uh income distributions in the US and in other countries as well and we'll take a look at that as well so you'll often see reported in the news or if you're reading stuff on the internet you'll often hear median salary or median household income reported rather than mean salary or average salary or average household income and that's because those distributions are skewed here's a more detailed graphic uh again this one I just took off the internet this is a histogram of household income in the United States it shows this really extremely positively skewed distribution okay here you're going to see a large difference between measures of central tendency like the mean the median and the mode okay so if I just took the average of all of these incomes we're going to have to average in these extremely high incomes way up here which are you know this isn't even that high the graph actually extends Way Beyond that to include people like Bill Gates and Mark Zuckerberg and so on uh right if I include their salaries in the mean the mean is really going to be inflated in contrast if I just take the median the middle score that's going to be right here no matter whether I include Mark Zuckerberg Bill Gates people like that in my distribution or not uh the median is still going to fall right here it's at about $49,444 uh dollar and if you look at this the mode is actually right about here um that's way down between like say 15 and 20,000 okay so when you have this extremely skewed distribution like this like us household income right now over the last few years and particular you're going to get a real difference between the mean the median and the mode because it's hard to find a measure of central tendency that's representative of the entire group right that's part of the problem uh I'm not going to get into a political discussion about this um it's just a good example uh to illustrate a non-normal distrib tion and how measures of central tendency can really differ when you have non-normal distributions if it was perfectly normal then the mean and the median and the mode would be just about the same so finally the mode is just the peak of a histogram it's just the score that occurs most often so um the Argento white wine distribution that was a real easy one to look at the peak remember because it was it was lepto kurtic um it had that really strong Peak if you look back at the graph it was right around 70 to 72 it was also about the average um because it didn't have really extreme rankings on the negative end or on the positive end um a thing about the mode is it can also be applied to nominal variables that's not true of the mean or the median right but we can apply it to uh nominal variables so what I did for fun is I just took these countries that we've been playing with um the US France Argentina Australia and I just said well in the last year what are the most common or what are the modal baby names uh these really surprised me um I don't know if my source is accurate um but in the in the US the most common female baby name in the last year is Sophia most common male name James um in France it's Emma and Nathan in Argentina it's sopia and Juan and in Australia it's Charlotte and Oliver those are the modal baby names they're the ones that occur most often it's only the mode that we can apply to nominal variables we can't apply the average for example to nominal variables or the median to nominal variablesso to be clear what do I mean by a measure of central tendency it's just a measure or a summary statistic that describes the middle point or is representative of the middle point of a distribution so it should be representative of the distribution as a whole of course we can't describe the whole distribution with one number right this is just what statisticians call A Point estimate uh but it should do a pretty good job it should be representative the most common examples of measures of central tendency in statistics are the mean the median and the mode so the mean we saw is just sum up all the scores divide by the number of scores it's the mean or the average the median is just the middle score in a distribution so if I just lined up all of the rankings all of the ratings on the wines from the highest ranking to the lowest ranking and just plucked out the middle ranking say the 50th ranking out of the 100 that would be my median uh ranking and the mode is easy that's just the score that occurs most often in the distribution and if now that you've seen histograms and know how to plot histograms it's real easy to see just look for the peak of the histogram that's the mode so the mean or the the average is by far the most common measure of central tendency used in in statistics it's the one we're going to rely on the most in this course uh in any intro stats course that's the one that you're going to rely on the most um so for example uh the Red Red Wine ratings another classic example is just your grade point average and those are the the mean is the best when you have a normal distribution if you don't have a normal distribution in particular if you have a distribution with really extreme scores so an extreme positive skew or an extreme negative skew then you might rather go with the median rather than the mean and that's because those few extreme scores can really bias the mean whereas the median if you just line the scores up and take the middle one those extreme scores are not going to bi us the median as much as they're going to bi us the mean so so we can look at the white wine ratings as an example of that but those weren't that skewed so we'll look at them so you won't see a big difference between the mean and Medi and you'll see a slight difference uh another example it's a classic example right now in in the US and and lots of uh modern countries right now is household income so household income particularly in the US right now uh particularly in Brazil right now um is really not normally distributed so there's a real positive skew in in the uh income distributions in the US and in other countries as well and we'll take a look at that as well so you'll often see reported in the news or if you're reading stuff on the internet you'll often hear median salary or median household income reported rather than mean salary or average salary or average household income and that's because those distributions are skewed here's a more detailed graphic uh again this one I just took off the internet this is a histogram of household income in the United States it shows this really extremely positively skewed distribution okay here you're going to see a large difference between measures of central tendency like the mean the median and the mode okay so if I just took the average of all of these incomes we're going to have to average in these extremely high incomes way up here which are you know this isn't even that high the graph actually extends Way Beyond that to include people like Bill Gates and Mark Zuckerberg and so on uh right if I include their salaries in the mean the mean is really going to be inflated in contrast if I just take the median the middle score that's going to be right here no matter whether I include Mark Zuckerberg Bill Gates people like that in my distribution or not uh the median is still going to fall right here it's at about $49,444 uh dollar and if you look at this the mode is actually right about here um that's way down between like say 15 and 20,000 okay so when you have this extremely skewed distribution like this like us household income right now over the last few years and particular you're going to get a real difference between the mean the median and the mode because it's hard to find a measure of central tendency that's representative of the entire group right that's part of the problem uh I'm not going to get into a political discussion about this um it's just a good example uh to illustrate a non-normal distrib tion and how measures of central tendency can really differ when you have non-normal distributions if it was perfectly normal then the mean and the median and the mode would be just about the same so finally the mode is just the peak of a histogram it's just the score that occurs most often so um the Argento white wine distribution that was a real easy one to look at the peak remember because it was it was lepto kurtic um it had that really strong Peak if you look back at the graph it was right around 70 to 72 it was also about the average um because it didn't have really extreme rankings on the negative end or on the positive end um a thing about the mode is it can also be applied to nominal variables that's not true of the mean or the median right but we can apply it to uh nominal variables so what I did for fun is I just took these countries that we've been playing with um the US France Argentina Australia and I just said well in the last year what are the most common or what are the modal baby names uh these really surprised me um I don't know if my source is accurate um but in the in the US the most common female baby name in the last year is Sophia most common male name James um in France it's Emma and Nathan in Argentina it's sopia and Juan and in Australia it's Charlotte and Oliver those are the modal baby names they're the ones that occur most often it's only the mode that we can apply to nominal variables we can't apply the average for example to nominal variables or the median to nominal variables\n"