Exploratory Data Analysis: A Powerful Tool for Identifying Data that Needs Further Investigation
Exploratory data analysis is a crucial step in understanding and working with datasets. It involves performing basic analyses to identify patterns, outliers, and missing values, which can help us determine what needs further investigation. In this article, we will delve into the world of exploratory data analysis and explore how it can be used to identify data that requires attention.
Counting Unique Values in Data
One of the most basic analyses we can perform is counting the unique values in our data. We can use the `info` method to get the data type of each column, which provides valuable information about the structure of our dataset. For example, let's say we want to count the frequency counts for the non-numeric columns "continent", "country", and "fertility". To do this, we select the column using dot notation or bracket notation, depending on our preference. Once we have selected the column, we can use the `value_counts` method to get the count of unique values in each column.
For instance, if we want to perform a frequency count on the "continent" column, we would first select it using dot notation: `df['continent'].value_counts()`. This will print out the counts for each unique value of the "continent" column in descending order. Similarly, if we use bracket notation to select the column, we get the same output.
In addition to counting unique values, we can also count the number of observations for each country using the `value_counts` method. We would select all columns containing the word "country" and chain together methods to only return the top five counts in our example data. However, this reveals a data point where Sweden has two observations, prompting us to investigate further.
Missing Values in Data
Another important aspect of exploratory data analysis is identifying missing values in our dataset. If a column contains missing values, they will be included in the count when we use the `value_counts` method. To avoid counting missing values, we can pass the `drop_na=False` parameter to the `value_counts` method.
For instance, let's say we want to perform a frequency count on the "continent" column and also count the number of missing values. We would select the column using dot notation: `df['continent'].value_counts(drop_na=False)`. This will print out the counts for each unique value in the column, including any missing values.
Furthermore, if our column contains missing values, it may have an incorrect data type, such as a string instead of a number. In this case, we need to recode the missing string values to ensure accurate analysis. For example, let's say we have a string named "missing" in the "fertility" column. This is why the fertility column has the wrong data type and alerts us that we need to recode the missing string.
Outliers in Data
Another important aspect of exploratory data analysis is identifying outliers in our dataset. Outliers are values that are considerably higher or lower than the rest of the data. There are many working definitions for outliers, and it's essential to consult a data statistics course for more detailed information.
To identify outliers, we can calculate summary statistics on numeric columns using the `describe` method. This method returns a number of non-missing values, mean, standard deviation, minimum 25th percentile, median (50th percentile), maximum 75th percentile, and finally the maximum value of our data.
For example, let's say we want to calculate summary statistics on the "population" column. We would select the column using dot notation: `df['population'].describe()`. This will print out a variety of statistics that can help us understand the distribution of values in this column.
A quick scan down the population results reveals that the maximum population value is 2.3 billion, which is not surprising since our data comes from 2012 and no country had a population of this magnitude at that time. Now it's your turn to calculate descriptive statistics for exploratory data analysis and see what needs cleaning in your data.
"WEBVTTKind: captionsLanguage: enin this video I will show you how we can use exploratory data analysis to help identify data that need further investigation the most basic analysis we can do is count the unique values in our data we can use the info method to get the data type of each column here I will show you the frequency counts for the non numeric continent country and fertility columns and the numeric population column the performer frequency count we first select the column we want to perform a frequency count on if the column name does not contain any special characters spaces and is not a name of a Python function we can select the column directly by its name using dot notation it works the same way as subsetting using bracket notation once we have the column selected we can use the value counts method on the selected column I like to use the drop na equals false parameter since it will also count the number of missing values if there are any the continent column does not have a missing value so num will be reported value counts will print out that counts for each unique value of a column in descending order note that even though we counted a column of the object D type the results of value counts will be of the type int another way we can select columns is using the bracket notation here is the same code and output as before this time using bracket notation to select a column now we will count the number of observations for each country in our data since there are too many countries to show at once I'm using the head method to only return the top five counts in this example I'm chaining together methods I'm slicing and getting the value counts just like before we expect each country to have only one observation but Sweden has two this will require us to investigate this data point further the fertility column is the column we expected to be numeric but stored as a string this is because we have a string named missing in the column this is why the fertility column has the wrong D type it also alerts us that we need to recode the missing string if your column has missing values they will also be counted provided you pass the drop na equals false parameter here you see we have 42 missing values in the column another type of EDA we can do is calculate summary statistics on numeric columns this can help spot outliers in our data there are many working definitions for outliers one definition is a value that is considerably higher or lower than the rest of the data you can consult the data can statistics course for more detailed definitions of outliers outliers are observations of interest we want to investigate further for data cleaning we can quickly calculate summary statistics on our data by using the describe method only the columns that have a numerical type will be returned describe returns a number of non missing values the mean standard deviation minimum 25th 50th and 75th percentiles of our data where the 50th percentile is the median and finally the maximum value of our data a quick scan down the population results show that the maximum population value is 2.3 billion people our data comes from 2012 no country had this population then now it's your turn to calculate descriptive statistics for exploratory data analysis to see what needs cleaning in your datain this video I will show you how we can use exploratory data analysis to help identify data that need further investigation the most basic analysis we can do is count the unique values in our data we can use the info method to get the data type of each column here I will show you the frequency counts for the non numeric continent country and fertility columns and the numeric population column the performer frequency count we first select the column we want to perform a frequency count on if the column name does not contain any special characters spaces and is not a name of a Python function we can select the column directly by its name using dot notation it works the same way as subsetting using bracket notation once we have the column selected we can use the value counts method on the selected column I like to use the drop na equals false parameter since it will also count the number of missing values if there are any the continent column does not have a missing value so num will be reported value counts will print out that counts for each unique value of a column in descending order note that even though we counted a column of the object D type the results of value counts will be of the type int another way we can select columns is using the bracket notation here is the same code and output as before this time using bracket notation to select a column now we will count the number of observations for each country in our data since there are too many countries to show at once I'm using the head method to only return the top five counts in this example I'm chaining together methods I'm slicing and getting the value counts just like before we expect each country to have only one observation but Sweden has two this will require us to investigate this data point further the fertility column is the column we expected to be numeric but stored as a string this is because we have a string named missing in the column this is why the fertility column has the wrong D type it also alerts us that we need to recode the missing string if your column has missing values they will also be counted provided you pass the drop na equals false parameter here you see we have 42 missing values in the column another type of EDA we can do is calculate summary statistics on numeric columns this can help spot outliers in our data there are many working definitions for outliers one definition is a value that is considerably higher or lower than the rest of the data you can consult the data can statistics course for more detailed definitions of outliers outliers are observations of interest we want to investigate further for data cleaning we can quickly calculate summary statistics on our data by using the describe method only the columns that have a numerical type will be returned describe returns a number of non missing values the mean standard deviation minimum 25th 50th and 75th percentiles of our data where the 50th percentile is the median and finally the maximum value of our data a quick scan down the population results show that the maximum population value is 2.3 billion people our data comes from 2012 no country had this population then now it's your turn to calculate descriptive statistics for exploratory data analysis to see what needs cleaning in your data\n"