Using Data Prep to Generate EDAs: A Step-by-Step Guide
The Dataprep library is an excellent tool for data scientists and analysts to visualize and explore their datasets. One of the most popular functionalities of Dataprep is its ability to generate Exploratory Data Analysis (EDA) reports. In this article, we will delve into the world of EDAs and show you how to use Dataprep to create interactive and informative reports.
Firstly, let's take a look at the Titanic dataset. The Titanic dataset contains information about the passengers and crew on board the Titanic, including their age, cabin number, class, and more. We can use this dataset to demonstrate the power of EDAs generated by Dataprep.
To begin with, we will use the plot_differences function to compare two data frames. In this case, we will split the Titanic dataset into two parts: one with 500 rows (df1) and the other with the remaining rows (df2). By using the plot_differences function with df1 and df2 as inputs, we can quickly glance at the value count for all parameters and see how they differ between the two data frames. This provides a quick and handy way to visualize the comparison between two data frames.
Another useful feature of Dataprep is its ability to generate heat maps, dendrograms, and other visualizations to help identify patterns and relationships in our data. In this case, we can use these visualizations to get a better understanding of the missing data in the Titanic dataset. By using the plot_diff function with different combinations of variables, such as age and cabin, we can quickly identify which variables are most affected by missing data.
Furthermore, Dataprep allows us to generate reports that can be exported as HTML files, making it easy to share our findings with colleagues and co-workers. We can use the create_report function to generate a report based on our dataset and then show it inside a Jupyter notebook or Google Colab. This feature is particularly useful for sharing our findings with others.
To demonstrate this, let's run the create_report function with the Titanic dataset as input. The report will be generated and assigned to the report variable. We can then show the report inside the Jupyter notebook by using the report.show method. By clicking on the "Overview" tab, we can see a summary of our findings, including statistics and visualizations.
Unfortunately, it appears that the report is not working as intended in some environments. To avoid this issue, we can save the report to a file using the report.save method. This will generate an HTML file containing our EDA report, which we can then open and view on our local computer.
The generated report provides a wealth of information about our dataset, including descriptive statistics, visualizations, and insights into missing data. By clicking on individual variables, we can see more detailed information, such as the minimum, maximum, median, and interquartile range values. We can also use the "Show Details" button to view additional information, such as the correlation between variables.
One of the most useful features of Dataprep reports is their interactivity. By hovering over visualizations, we can see the parameters shown on our screen. This makes it easy to explore our data and gain insights into its structure and behavior.
Finally, let's take a look at some of the other features included in our EDA report. We have scatter plots for depicting interactions between variables, correlation plots for comparing relationships between variables, and more. These visualizations allow us to identify patterns and relationships in our data that might not be immediately apparent from looking at raw data.
In conclusion, Dataprep is an excellent tool for generating EDAs. Its ability to create interactive and informative reports makes it easy to explore and analyze complex datasets. By following the steps outlined in this article, you can use Dataprep to generate your own EDAs and gain a deeper understanding of your dataset.
"WEBVTTKind: captionsLanguage: enso this is the website of the dataprep python library and so let's dive into the tutorial of this video so let's start out by installing the dataprep library all right and now we're going to have a look at some of the data sets that are available in the dataprep library so we can see here that there are quite a few wine quality titanic adult patient info countries house price train and also the test kovit 19 waste hauler and also the iris data set and so here we're going to load in the example data which is the titanic and then we're going to assign it to the df data frame and then this is the data set here and notice that there are 891 rows 12 columns and you can see here that there are quite a few missing data here and let's have a look at the functionalities of the data prep so if you go back to the website here and let's have a look at the documentation alright and so here you can see that there are about three essential key functionality so the first one is the connector and so connector will allow you to work with the web api for collecting data and the second one is the eda and so eda will allow you to perform exploratory data analysis and there are quite a few examples here using the various functions that are available and there is also another which is the clean function which allows you to perform various cleaning functions such as renaming column headers dealing with country names because most likely you could have different variation of the country name for example you could have acronym you could have abbreviation you could have misspellings of the country or even the characters that are used to spell the country names might be a bit different and so here they allow some functionality in order to make it uniform and make the country name clean and as well as the other parameters that you can see here like date and time and also to deal with duplicate values as well as urls phone number ip address etc so let's have a look at the code now so here we're going to import the plot function from the dataprep.eda stopmodule and then we're going to make a plot of the df which is the titanic dataset and then this should take a short moment all right and here you go so you have the plus and you could see the comparison between the survived and not survived or even the various bars so this is the plot that is generated from the data prep library and it provides you with the ability to quickly glance at the key features of the underlying data all right and let's have a look at the plot correlation function let's run it all right and so this is the basic statistics of the correlation and so here you have the columns here pearson spearman and kendall tau and you have the highest positive correlation and also the highest negative correlation which is summarized for your convenience as well as the lowest correlation and also the mean correlation let's have a look at the pearson now and so you can see the intercorrelation matrix and also the spearmint and the kendall tau all right let's move on let's have a look at the missing values by using the plot missing function so here we're going to import the plot missing function and then we're going to use it and the input argument is df and so similar to the above function you have the tabs to allow you to select the various functionality here so here we have the missing statistics so there are 866 missing cells which account for about 8.1 percent of the data and so missing data are in three columns and there are over 708 rows with missing values let's have a look at the bar chart a bit and so here you could quickly see that the age parameter and also the cabin has a lot of missing value where the cabin parameter here has over 77 missing value or 687 to be exact as far as the age 177 are missing which is roughly 19.87 percent and so this is a great way to have a look at the missing data very quickly and this is also another one so you can see the age and the cabin with the missing data indicator let's have a look at the heat map also indicating that age and cabin have the missing data dendrogram here as well cabin and age all right let's have a look further and so for here you're going to use the plot diff function which will allow you to compare the two data frames so we're going to split the titanic data frame into two so we're going to split it by selecting the first 500 rows and then assigning it to df1 and then we're going to assign the remaining rows to df2 let's do that all right now we have two data frames the f1 and df2 and so let's compare them by using the plot div function and then the input argument is bracket df1 and df2 and so you can see the plots comparing the two bars so df1 is in blue and df2 is in orange so you could quickly glance at the value count here for all of the parameters and here is the histogram for both in blue and orange all right so this provides you a quick and handy way to visualize the comparison between two data frames and so i think this is probably one of the most popular functionality of the dataprep library which is to generate the eda reports and the great thing about this is that you could also export it out as a html file and you could share it and then all of the corresponding images of the plot will also go along with it so let's have a look here so let me delete the example from the prior run all right so if you use report.show where report will be assigned by the create report function and then the input argument is df let's run that first and now the report will now be generated and assigned to the report variable and then we're going to show it inside the jupyter notebook here in the google collab and as noted here you could also try the show browser function all right and so this is the report so it might be a bit small for the screen because it is inside a cell of the output so let's have a look here briefly you can click on overview okay so it's not working as intended let's run it again so i think it's better to save it out okay so it's quite interactive and so you could click on okay so it's not working as as intended so let me just save it out as a report right here report.save reports let's do that and also i tried from a prior run the show browser function also did not work on google collab and so this might also work on the jupiter notebook on the local computer and so i've tried from the prior run that the save function works so it'll generate the html file right here let's download it so the exact same report that you see here will be saved into the html file okay so i think it's downloaded already all right so let's allow it to download to the computer all right let me try again download all right and so let's open it up and so this is the generated eda report so you can see at a glance the data set statistic and then there are quite a few missing data and let's have a look at the insights of the various columns here and so it provides a pretty good summary it even tells us that the fair variable is skewed and it says here that agent cabin has pretty high missing value let's have a look at page two so the remainder of the variable insights and here let's have a look at the variables so each of the variables are shown in the plots here as well as the descriptive summary statistics let's click on the show details all right and it provides more into the minimum medium maximum the interquartile range and also other descriptive statistics also the plots for the kde the normality plot box plot okay so this comes in handy and they're pretty interactive as well so if you hover your mouse over it you will see the parameters shown here let's have a look further and this is the scatter plot for depicting the interaction between the variables so you could try out the different descriptors and their interactions and so this is the correlation plot pearson spearman kendall tau and we have here the missing value as we have seen prior to this the spectrum the heat map and also the dendrogram okay so you could also try out the menu bar here as well and it'll hop over to the corresponding position so correlation here interactions is here missing value is this one overview is here all right so this is the generated report and so you could share this report to your colleague and co-workers and so please drop a comment on which feature did you like about dataprepso this is the website of the dataprep python library and so let's dive into the tutorial of this video so let's start out by installing the dataprep library all right and now we're going to have a look at some of the data sets that are available in the dataprep library so we can see here that there are quite a few wine quality titanic adult patient info countries house price train and also the test kovit 19 waste hauler and also the iris data set and so here we're going to load in the example data which is the titanic and then we're going to assign it to the df data frame and then this is the data set here and notice that there are 891 rows 12 columns and you can see here that there are quite a few missing data here and let's have a look at the functionalities of the data prep so if you go back to the website here and let's have a look at the documentation alright and so here you can see that there are about three essential key functionality so the first one is the connector and so connector will allow you to work with the web api for collecting data and the second one is the eda and so eda will allow you to perform exploratory data analysis and there are quite a few examples here using the various functions that are available and there is also another which is the clean function which allows you to perform various cleaning functions such as renaming column headers dealing with country names because most likely you could have different variation of the country name for example you could have acronym you could have abbreviation you could have misspellings of the country or even the characters that are used to spell the country names might be a bit different and so here they allow some functionality in order to make it uniform and make the country name clean and as well as the other parameters that you can see here like date and time and also to deal with duplicate values as well as urls phone number ip address etc so let's have a look at the code now so here we're going to import the plot function from the dataprep.eda stopmodule and then we're going to make a plot of the df which is the titanic dataset and then this should take a short moment all right and here you go so you have the plus and you could see the comparison between the survived and not survived or even the various bars so this is the plot that is generated from the data prep library and it provides you with the ability to quickly glance at the key features of the underlying data all right and let's have a look at the plot correlation function let's run it all right and so this is the basic statistics of the correlation and so here you have the columns here pearson spearman and kendall tau and you have the highest positive correlation and also the highest negative correlation which is summarized for your convenience as well as the lowest correlation and also the mean correlation let's have a look at the pearson now and so you can see the intercorrelation matrix and also the spearmint and the kendall tau all right let's move on let's have a look at the missing values by using the plot missing function so here we're going to import the plot missing function and then we're going to use it and the input argument is df and so similar to the above function you have the tabs to allow you to select the various functionality here so here we have the missing statistics so there are 866 missing cells which account for about 8.1 percent of the data and so missing data are in three columns and there are over 708 rows with missing values let's have a look at the bar chart a bit and so here you could quickly see that the age parameter and also the cabin has a lot of missing value where the cabin parameter here has over 77 missing value or 687 to be exact as far as the age 177 are missing which is roughly 19.87 percent and so this is a great way to have a look at the missing data very quickly and this is also another one so you can see the age and the cabin with the missing data indicator let's have a look at the heat map also indicating that age and cabin have the missing data dendrogram here as well cabin and age all right let's have a look further and so for here you're going to use the plot diff function which will allow you to compare the two data frames so we're going to split the titanic data frame into two so we're going to split it by selecting the first 500 rows and then assigning it to df1 and then we're going to assign the remaining rows to df2 let's do that all right now we have two data frames the f1 and df2 and so let's compare them by using the plot div function and then the input argument is bracket df1 and df2 and so you can see the plots comparing the two bars so df1 is in blue and df2 is in orange so you could quickly glance at the value count here for all of the parameters and here is the histogram for both in blue and orange all right so this provides you a quick and handy way to visualize the comparison between two data frames and so i think this is probably one of the most popular functionality of the dataprep library which is to generate the eda reports and the great thing about this is that you could also export it out as a html file and you could share it and then all of the corresponding images of the plot will also go along with it so let's have a look here so let me delete the example from the prior run all right so if you use report.show where report will be assigned by the create report function and then the input argument is df let's run that first and now the report will now be generated and assigned to the report variable and then we're going to show it inside the jupyter notebook here in the google collab and as noted here you could also try the show browser function all right and so this is the report so it might be a bit small for the screen because it is inside a cell of the output so let's have a look here briefly you can click on overview okay so it's not working as intended let's run it again so i think it's better to save it out okay so it's quite interactive and so you could click on okay so it's not working as as intended so let me just save it out as a report right here report.save reports let's do that and also i tried from a prior run the show browser function also did not work on google collab and so this might also work on the jupiter notebook on the local computer and so i've tried from the prior run that the save function works so it'll generate the html file right here let's download it so the exact same report that you see here will be saved into the html file okay so i think it's downloaded already all right so let's allow it to download to the computer all right let me try again download all right and so let's open it up and so this is the generated eda report so you can see at a glance the data set statistic and then there are quite a few missing data and let's have a look at the insights of the various columns here and so it provides a pretty good summary it even tells us that the fair variable is skewed and it says here that agent cabin has pretty high missing value let's have a look at page two so the remainder of the variable insights and here let's have a look at the variables so each of the variables are shown in the plots here as well as the descriptive summary statistics let's click on the show details all right and it provides more into the minimum medium maximum the interquartile range and also other descriptive statistics also the plots for the kde the normality plot box plot okay so this comes in handy and they're pretty interactive as well so if you hover your mouse over it you will see the parameters shown here let's have a look further and this is the scatter plot for depicting the interaction between the variables so you could try out the different descriptors and their interactions and so this is the correlation plot pearson spearman kendall tau and we have here the missing value as we have seen prior to this the spectrum the heat map and also the dendrogram okay so you could also try out the menu bar here as well and it'll hop over to the corresponding position so correlation here interactions is here missing value is this one overview is here all right so this is the generated report and so you could share this report to your colleague and co-workers and so please drop a comment on which feature did you like about dataprep\n"