Python Tutorial - Introduction to Exploratory Data Analysis

The Power of Exploratory Data Analysis: Understanding Yogi Berra's Wisdom

Yogi Berra, a legendary baseball player and manager, once said, "You can observe a lot by watching." This phrase has been echoed in the world of data analysis, where exploring your data is crucial to drawing conclusions. According to John Tukey, one of the greatest statisticians of all time, exploratory data analysis (EDA) is an essential step in your analysis. In his book "Exploratory Data Analysis" published in 1977, Tukey laid out the principles of EDA and emphasized its importance.

EDA involves organizing, plotting, and computing numerical summaries about your data. By doing so, you can gain insights into the patterns and structures within your data, which can inform your analysis and lead to new discoveries. As Tukey aptly put it, "Exploratory Data Analysis can never be the whole story, but nothing else can serve as the foundation stone." This means that EDA is a critical component of the analytical process, but it should not be seen as the final destination.

To illustrate the importance of EDA, let's consider an example. We have a dataset containing election results from 2008 at the county level in three major swing states: Pennsylvania, Ohio, and Florida. These states are crucial in determining recent elections in the US. When we open this file in our text editor, we see a list of counties with their corresponding state, county, and share of votes that went to Democrat Barack Obama.

While staring at these numbers might seem like a good starting point, it's indeed hopeless to gain any meaningful understanding from doing so. Instead, we can charge headlong into defining parameters, computing confidence intervals, and performing hypothesis tests. However, this approach is similar to a field commander charging into battle without first getting a feel for the terrain and sizing up the opposing army.

A good field commander takes the time to explore the data before engaging in battle. Similarly, in statistical analysis, we should take the time to explore our data using EDA techniques. In this chapter, we'll discuss graphical exploratory data analysis, which involves taking data from its tabular form and representing it graphically. By doing so, we can present the same information in a more human-interpretable format.

For instance, let's plot the Democratic share of the vote in the counties of all three swing states as a histogram. The height of each bar represents the number of counties that had a given level of support for Barack Obama. Upon examining this plot, we can see that there are more counties that voted for Obama's opponent, John McCain, than for Obama. This conclusion would have been extremely tedious to draw by hand, but with EDA, we can quickly gain insight into our data.

By exploring our data using graphical exploratory data analysis techniques, we can already start drawing conclusions from it. In this chapter, we'll delve deeper into the basics of EDA and review some exercises to help you develop your skills in this critical area of statistical thinking.

"WEBVTTKind: captionsLanguage: enYogi Berra said you can observe a lot by watching the same is true with data if you can appropriately display your data you can already start to draw conclusions from it I'll go even further exploring your data is a crucial step in your analysis when I say exploring your data I mean organizing and plotting your data and maybe computing a few numerical summaries about them this idea is known as exploratory data analysis or EDA and was developed by one of the greatest statisticians of all time John Tukey he wrote a book entitled exploratory data analysis in 1977 where he laid out the principles in that book he said exploratory data analysis can never be the whole story but nothing else can serve as the foundation stone I wholeheartedly agree with this so we will begin our study of statistical thinking with EDA let's consider an example here I have a data set I acquired from data gov containing the election results of 2008 at the county level in each of the three major swing states of Pennsylvania Ohio and Florida these are the ones that largely decide recent elections in the US this is how they look when I open the file in my text editor they are a little prettier if we look at them in a panda's data frame in this case we are only looking at the columns of immediate interest the state county and share of votes that went to Democrat Barack Obama now we could stare at these numbers but I think you'll agree that it is pretty hopeless to gain any sort of understanding from doing this alternatively we could charge in headlong and start defining and computing parameters and their confidence intervals and do hypothesis tests now you will learn how to do all of these things in this course and its sequel but a good field commander does not just charge into battle without first getting a feel for the terrain and sizing up the opposing army so like the field commander we should explore the data first in this chapter we will discuss graphical exploratory data analysis this involves taking data from tabular form like we have here in the data frame and representing it graphically you are presenting the same information but it is in a more human interpretable form for example we take the Democratic share of the vote in the counties of all three swing states and plot them as a histogram the height of each bar is a number of counties that had a given level of support for Barack Obama for example the tallest bar is the number of counties that had between 40% and 50% of its votes cast for Obama right away because there is more area in the histogram to the left of 50% we can see that more counties voted for Obama's opponent John McCain than voted for Obama look at that just by making one plot we could already draw a conclusion from the data now this would have been extraordinarily tedious if we did it by hand counting in the data frame now let's review some of the basic ideas behind EDA with a couple of exercisesYogi Berra said you can observe a lot by watching the same is true with data if you can appropriately display your data you can already start to draw conclusions from it I'll go even further exploring your data is a crucial step in your analysis when I say exploring your data I mean organizing and plotting your data and maybe computing a few numerical summaries about them this idea is known as exploratory data analysis or EDA and was developed by one of the greatest statisticians of all time John Tukey he wrote a book entitled exploratory data analysis in 1977 where he laid out the principles in that book he said exploratory data analysis can never be the whole story but nothing else can serve as the foundation stone I wholeheartedly agree with this so we will begin our study of statistical thinking with EDA let's consider an example here I have a data set I acquired from data gov containing the election results of 2008 at the county level in each of the three major swing states of Pennsylvania Ohio and Florida these are the ones that largely decide recent elections in the US this is how they look when I open the file in my text editor they are a little prettier if we look at them in a panda's data frame in this case we are only looking at the columns of immediate interest the state county and share of votes that went to Democrat Barack Obama now we could stare at these numbers but I think you'll agree that it is pretty hopeless to gain any sort of understanding from doing this alternatively we could charge in headlong and start defining and computing parameters and their confidence intervals and do hypothesis tests now you will learn how to do all of these things in this course and its sequel but a good field commander does not just charge into battle without first getting a feel for the terrain and sizing up the opposing army so like the field commander we should explore the data first in this chapter we will discuss graphical exploratory data analysis this involves taking data from tabular form like we have here in the data frame and representing it graphically you are presenting the same information but it is in a more human interpretable form for example we take the Democratic share of the vote in the counties of all three swing states and plot them as a histogram the height of each bar is a number of counties that had a given level of support for Barack Obama for example the tallest bar is the number of counties that had between 40% and 50% of its votes cast for Obama right away because there is more area in the histogram to the left of 50% we can see that more counties voted for Obama's opponent John McCain than voted for Obama look at that just by making one plot we could already draw a conclusion from the data now this would have been extraordinarily tedious if we did it by hand counting in the data frame now let's review some of the basic ideas behind EDA with a couple of exercises\n"