How to Quickly Perform Exploratory Data Analysis (EDA) in Python using Sweetviz

Exploratory Data Analysis using Switviz Library

This is a quick exploratory data analysis and you can see that the entire visualization that you're seeing here will require you to use only a single line of code which is essentially sv.analyze. Okay so this is the only one liner from the switviz library that will generate the analysis and the others are just visualizing the html file using the show html function.

All right let's proceed further so recall a moment ago i've split the data into x and y so normally when we create a data set for performing model building using the scikit-learn package we're going to be splitting the data into x and y and then x will be used to train the model and y as well but then the y will be the class label and so the model building process will require that we split the data set into x and y and then we're going to be splitting x and y further into training set and testing set and so here we're going to be using the 80 20 split ratio and so we're gonna use the function train test split from the scikit-learn package.

Let's go ahead and run that all right and so let's have a look at the x screen so we're gonna see here that there are 266 rows six column and x test all right and so it's accounting for 67 rows and also sixth column so we're seeing the 80 splits here at the x strain and the x test will be the 20 subsets so here we're going to use the switviz library to compare between the training set and the test set so let's have a look here so as always it's going to be a one-liner code so sv.compare and then as input argument we're going to use the xtrain comma the x test okay and then we're going to be labeling it train and test and then finally we're going to be using the show html function to display the generated compare.html file and so we're not gonna let the google code lab open the generated html so we're gonna use equals to files here.

Let's run it and in just a moment you'll be getting the html file all right now we have generated the file let's take a look okay and so you're going to see two colors the blue color and the orange color which will represent the train set and the testing set so this part will allow you to see the quick breakdown comparing the train and test set so they comprise of 266 rows and 67 rows for the test and we're gonna see the breakdown for each of the values here this we're going to see that for all of them bisco dream or turgerson they're going to have roughly similar frequency here for the train and test because they are randomly split between the train set and the test set so if we randomize the splitting we're anticipating that the histogram that we're gonna see here might be different.

So what you're seeing and what i'm seeing might be different because if we're not setting the seed number then the randomization effect will allow the c number to vary at each run and so we're not gonna get the same data splits and so you might be seeing some variation between what you're doing and what i'm showing here but the way to solve that is to set the seed number so let's have that as your homework try setting the seed number and you're going to be seeing that the distribution from the data splitting will be the same every time that you perform the analysis here okay.

So in a nutshell you're going to be seeing the comparison between the training set and the test set for each of the columns or each of the variables okay so let me show you here you could also download the analyze and the compare.html file into your own computer and then let's open it up in our computer analyze so this is the html file so you can even email this to your colleague oh it's blocking so by default the pop-up was blocked let me try again all right now it works here you go so the comparison between the training set and the test set okay.

So feel free to share this with your colleague and as you can see here each of these analysis will take you only a couple of seconds because it requires just a one liner and another one liner to display the html right inside the jupyter notebook and as you can see here we can also export it out save it and then email it to your own colleague and so if you're finding value in this video please give it a like subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videos

"WEBVTTKind: captionsLanguage: enin this video i'm going to be showing you how to perform a quick exploratory data analysis using the sweet vis library in python and so without further ado we're starting right now so fire up the notebook that i'm providing in the description of this video and so let's first start by installing the switviz library so this is a simple pip install all right and so it's now successfully installed and now we're going to be reading in the penguins data set so import pandas as pd and we're going to be using the read csv function and as import argument it's going to be the url from the data professor github of the cleaned version of the penguins data set and then we're going to be assigning it to the penguins variable and now we're going to be splitting the penguins data set into the x and the y variables so we're going to split y to the species column and x will be everything else and that's why we're going to drop the species so let's have a look at the x variable and so you're going to see the following columns here and so you're going to see here that it comprises of both qualitative and quantitative data types and there are a total of six columns and 333 rolls and now we're going to be analyzing the penguins data set using the switviz library so let's take a look at the penguins data set so for this one we're going to take a look at the entire data set this is before splitting it into x and y and that's why we're going to have the column called species here which will be why in the subsequent analysis which i will be showing you in just a moment so here we're going to be using the entire data set of the penguins so there will be a total of 7 columns 333 rows all right now comes the fun part so we're going to create a variable called analyze report and so we're going to use the analyze function from the split vis library and as input argument it's going to be the penguins data set and then we're going to be using the show html function and then creating a html file called analyze.html okay so we haven't yet run okay so we have to first import subway switchbiz as sv all right all right so now the analyze.html file was generated and to the left here in the google collab you could have a look at the files in the working directory so you want to click on the files icon here and then here you're going to see the analyze.html so you could either download this into your computer or you could also display it here so why don't i start by displaying it right inside the google code lab so we're gonna run this cell all right so this is the quick exploratory data analysis so as you can see here you could quickly see the general distribution of the entire data set here so for the species data let me collapse this for a moment so for the species variable we're going to be seeing that there are three distinct types adelie gentoo and tinstrap and we're going to be seeing that adelie accounted for more than 40 of the entire data set and coming in at number two is gen 2 followed by chin strap and for the island column we're going to be seeing that there are three distinct types bisco dream and torgersen and the most prevalent data is the bisco followed by dream followed by torgersen so you're going to see the relative percentage for which each of these are accounting for and these will be the histogram because they are quantitative as for the prior two they are qualitative variable and same thing for build app so we're going to see the histogram distribution so each bar will represent a range like somewhere between 32 to 35 35 to 37 etc and so we're going to be seeing that there are 163 distinct data values because they are quantitative versus 3 for the qualitative data types and so same for build length build depth flipper length and body mass so there will be the histogram distribution and for each of them you will be seeing the maximum the minimum value and then the median the average and also the q3 q1 and also the 95 and 5 the values of each of the parameters here okay and also other statistical parameters as well and then the seventh column here sex two distinct male female and they're roughly similar accounting at roughly half of the data set and here to the right if you hover your mouse you will be seeing more details like if i hover on species i'm going to be seeing the same thing but then it's going to be slightly bigger and we're going to be seeing the specific number like for example the adelie accounted for 44 and so they have 146. so n is equal to 146 and 119 is accounting for 36 and 68 for chin strap so out of 333 we're going to be seeing what is the end size for each of the adelie gen 2 and chin strap okay so this is a quick exploratory data analysis and you can see that the entire visualization that you're seeing here will require you to use only a single line of code which is essentially sv.analyze okay so this is the only one liner from the switviz library that will generate the analysis and the others are just visualizing the html file using the show html function all right let's proceed further so recall a moment ago i've split the data into x and y so normally when we create a data set for performing model building using the scikit-learn package we're going to be splitting the data into x and y and then x will be used to train the model and y as well but then the y will be the class label and so the model building process will require that we split the data set into x and y and then we're going to be splitting x and y further into training set and testing set and so here we're going to be using the 80 20 split ratio and so we're gonna use the function train test split from the scikit-learn package let's go ahead and run that all right and so let's have a look at the x screen so we're gonna see here that there are 266 rows six column and x test all right and so it's accounting for 67 rows and also sixth column so we're seeing the 80 splits here at the x strain and the x test will be the 20 subsets so here we're going to use the switviz library to compare between the training set and the test set so let's have a look here so as always it's going to be a one-liner code so sv.compare and then as input argument we're going to use the xtrain comma the x test okay and then we're going to be labeling it train and test and then finally we're going to be using the show html function to display the generated compare.html file and so we're not gonna let the google code lab open the generated html so we're gonna use equals to files here let's run it and in just a moment you'll be getting the html file all right now we have generated the file let's take a look okay and so you're going to see two colors the blue color and the orange color which will represent the train set and the testing set so this part will allow you to see the quick breakdown comparing the train and test set so they comprise of 266 rows and 67 rows for the test and we're gonna see the breakdown for each of the values here this we're going to see that for all of them bisco dream or turgerson they're going to have roughly similar frequency here for the train and test because they are randomly split between the train set and the test set so if we randomize the splitting we're anticipating that the histogram that we're gonna see here might be different so what you're seeing and what i'm seeing might be different because if we're not setting the seat number then the randomization effect will allow the c number to vary at each run and so we're not gonna get the same data splits and so you might be seeing some variation between what you're doing and what i'm showing here but the way to solve that is to set the seed number so let's have that as your homework try setting the seed number and you're going to be seeing that the distribution from the data splitting will be the same every time that you perform the analysis here okay and so in a nutshell you're going to be seeing the comparison between the training set and the test set for each of the columns or each of the variables okay so let me show you here you could also download the analyze and the compare.html file into your own computer and then let's open it up in our computer analyze so this is the html file so you can even email this to your colleague oh it's blocking so by default the pop-up was blocked let me try again all right now it works here you go so the comparison between the training set and the test set okay so feel free to share this with your colleague and as you can see here each of these analysis will take you only a couple of seconds because it requires just a one liner and another one liner to display the html right inside the jupyter notebook and as you can see here we can also export it out save it and then email it to your own colleague and so if you're finding value in this video please give it a like subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videosin this video i'm going to be showing you how to perform a quick exploratory data analysis using the sweet vis library in python and so without further ado we're starting right now so fire up the notebook that i'm providing in the description of this video and so let's first start by installing the switviz library so this is a simple pip install all right and so it's now successfully installed and now we're going to be reading in the penguins data set so import pandas as pd and we're going to be using the read csv function and as import argument it's going to be the url from the data professor github of the cleaned version of the penguins data set and then we're going to be assigning it to the penguins variable and now we're going to be splitting the penguins data set into the x and the y variables so we're going to split y to the species column and x will be everything else and that's why we're going to drop the species so let's have a look at the x variable and so you're going to see the following columns here and so you're going to see here that it comprises of both qualitative and quantitative data types and there are a total of six columns and 333 rolls and now we're going to be analyzing the penguins data set using the switviz library so let's take a look at the penguins data set so for this one we're going to take a look at the entire data set this is before splitting it into x and y and that's why we're going to have the column called species here which will be why in the subsequent analysis which i will be showing you in just a moment so here we're going to be using the entire data set of the penguins so there will be a total of 7 columns 333 rows all right now comes the fun part so we're going to create a variable called analyze report and so we're going to use the analyze function from the split vis library and as input argument it's going to be the penguins data set and then we're going to be using the show html function and then creating a html file called analyze.html okay so we haven't yet run okay so we have to first import subway switchbiz as sv all right all right so now the analyze.html file was generated and to the left here in the google collab you could have a look at the files in the working directory so you want to click on the files icon here and then here you're going to see the analyze.html so you could either download this into your computer or you could also display it here so why don't i start by displaying it right inside the google code lab so we're gonna run this cell all right so this is the quick exploratory data analysis so as you can see here you could quickly see the general distribution of the entire data set here so for the species data let me collapse this for a moment so for the species variable we're going to be seeing that there are three distinct types adelie gentoo and tinstrap and we're going to be seeing that adelie accounted for more than 40 of the entire data set and coming in at number two is gen 2 followed by chin strap and for the island column we're going to be seeing that there are three distinct types bisco dream and torgersen and the most prevalent data is the bisco followed by dream followed by torgersen so you're going to see the relative percentage for which each of these are accounting for and these will be the histogram because they are quantitative as for the prior two they are qualitative variable and same thing for build app so we're going to see the histogram distribution so each bar will represent a range like somewhere between 32 to 35 35 to 37 etc and so we're going to be seeing that there are 163 distinct data values because they are quantitative versus 3 for the qualitative data types and so same for build length build depth flipper length and body mass so there will be the histogram distribution and for each of them you will be seeing the maximum the minimum value and then the median the average and also the q3 q1 and also the 95 and 5 the values of each of the parameters here okay and also other statistical parameters as well and then the seventh column here sex two distinct male female and they're roughly similar accounting at roughly half of the data set and here to the right if you hover your mouse you will be seeing more details like if i hover on species i'm going to be seeing the same thing but then it's going to be slightly bigger and we're going to be seeing the specific number like for example the adelie accounted for 44 and so they have 146. so n is equal to 146 and 119 is accounting for 36 and 68 for chin strap so out of 333 we're going to be seeing what is the end size for each of the adelie gen 2 and chin strap okay so this is a quick exploratory data analysis and you can see that the entire visualization that you're seeing here will require you to use only a single line of code which is essentially sv.analyze okay so this is the only one liner from the switviz library that will generate the analysis and the others are just visualizing the html file using the show html function all right let's proceed further so recall a moment ago i've split the data into x and y so normally when we create a data set for performing model building using the scikit-learn package we're going to be splitting the data into x and y and then x will be used to train the model and y as well but then the y will be the class label and so the model building process will require that we split the data set into x and y and then we're going to be splitting x and y further into training set and testing set and so here we're going to be using the 80 20 split ratio and so we're gonna use the function train test split from the scikit-learn package let's go ahead and run that all right and so let's have a look at the x screen so we're gonna see here that there are 266 rows six column and x test all right and so it's accounting for 67 rows and also sixth column so we're seeing the 80 splits here at the x strain and the x test will be the 20 subsets so here we're going to use the switviz library to compare between the training set and the test set so let's have a look here so as always it's going to be a one-liner code so sv.compare and then as input argument we're going to use the xtrain comma the x test okay and then we're going to be labeling it train and test and then finally we're going to be using the show html function to display the generated compare.html file and so we're not gonna let the google code lab open the generated html so we're gonna use equals to files here let's run it and in just a moment you'll be getting the html file all right now we have generated the file let's take a look okay and so you're going to see two colors the blue color and the orange color which will represent the train set and the testing set so this part will allow you to see the quick breakdown comparing the train and test set so they comprise of 266 rows and 67 rows for the test and we're gonna see the breakdown for each of the values here this we're going to see that for all of them bisco dream or turgerson they're going to have roughly similar frequency here for the train and test because they are randomly split between the train set and the test set so if we randomize the splitting we're anticipating that the histogram that we're gonna see here might be different so what you're seeing and what i'm seeing might be different because if we're not setting the seat number then the randomization effect will allow the c number to vary at each run and so we're not gonna get the same data splits and so you might be seeing some variation between what you're doing and what i'm showing here but the way to solve that is to set the seed number so let's have that as your homework try setting the seed number and you're going to be seeing that the distribution from the data splitting will be the same every time that you perform the analysis here okay and so in a nutshell you're going to be seeing the comparison between the training set and the test set for each of the columns or each of the variables okay so let me show you here you could also download the analyze and the compare.html file into your own computer and then let's open it up in our computer analyze so this is the html file so you can even email this to your colleague oh it's blocking so by default the pop-up was blocked let me try again all right now it works here you go so the comparison between the training set and the test set okay so feel free to share this with your colleague and as you can see here each of these analysis will take you only a couple of seconds because it requires just a one liner and another one liner to display the html right inside the jupyter notebook and as you can see here we can also export it out save it and then email it to your own colleague and so if you're finding value in this video please give it a like subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videos\n"