Python Tutorial - Exploratory data analysis

Let's Now Jump into Our First Data Set: Iris Flowers

Our first data set contains data pertaining to iris flowers. The features consist of 4 measurements: petal length, petal width, sepal length, and sepal width. The target variable includes the species of flower and there are three possibilities: versicolor, virginica, and setosa.

We will import it from scikit-learn, as this is one of the data sets included in scikit-learn. We'll also import pandas, numpy, and pipeline under their standard aliases. Additionally, we'll set the plotting style to GG plot using PLT dot style dot use firstly because it looks great and secondly in order to help all you are aficionados feel at home.

We then load the data set with data sets load iris and assign the data to a variable iris. Checking out the type of iris, we see that it's a bunch which is similar to a dictionary in that it contains key-value pairs. Printing the keys, we see that they are the feature names desk, which provides a description of the data set.

The target names the data, which contains the values features and the target. We can see both the feature and target data are provided as an umpire raise the dot shape attribute of the array feature tells us that there are 150 rows and four columns remember samples are in rows features are in columns thus we have a hundred fifty samples and the four features petal length and width and sepal length and width. Moreover, note that the target variable is included as zero for setosa one for versicolor and two for virginica.

We can see this by printing iris.dot.target names, where Satou sir corresponds to index zero, versicolor to index 1, and virginica to index two. To perform some initial exploratory data analysis or EDA for short, we'll assign the feature and target data to x and y respectively.

We'll then build a data frame of the feature data using PD data frame and also passing column names viewing the head of the data frame shows us the first five rows. Now, let's do a bit of visual EDA. We use the pandas function scatter matrix to visualize our data set. We pass it our data frame along with our target variable as argument to the parameter C which stands for color ensuring that our data points in our figure will be colored by their species.

We also pass a list to fig size which specifies the size of our figure as well as a marker size and shape. The result is a matrix of figures which on the diagonal are histograms of the features corresponding to the row and column, the off-diagonal figures are scatter plots of the column feature versus row feature colored by the target variable.

There is a great deal of information in this scatter matrix. For example, we can see that petal width and length are highly correlated as you may expect. And that flowers are clustered according to species.

Now it's your turn to dive into a few exercises and to do some EDA then we'll be back to do some machine learning enjoyment.

"WEBVTTKind: captionsLanguage: enlet's now jump into our first data set it contains data pertaining to iris flowers in which the features consist of 4 measurements petal length petal width sepal length and sepal width the target variable includes the species of flower and there are three possibilities versicolor virginica and cito so as this is one of the data sets included in scikit-learn we'll import it from there with from SK learn import data sets in the exercises you'll get practice at importing files from your local file system for supervised learning we'll also import pandas numpy and pipeline under their standard aliases in addition we'll set the plotting style to GG plot using PLT dot style dot use firstly because it looks great and secondly in order to help all you are aficionados feel at home we then load the data set with data sets load iris and assign the data to a variable iris checking out the type of iris we see that it's a bunch which is similar to a dictionary in that it contains key value pairs printing the keys we see that they are the feature names desk which provides a description of the data set the target names the data which contains the values features and the target which is the target data as you see here both the feature and target data are provided as an umpire raise the dot shape attribute of the array feature tells us that there are a hundred and fifty rows and four columns remember samples are in rows features are in columns thus we have a hundred fifty samples and the four features petal length and width and sepal length and width moreover note that the target variable is included as zero for setosa one for versicolor and two for virginica we can see this by printing iris dot target names in which Satou sir corresponds to index zero versicolor to index 1 and virginica to index two in order to perform some initial exploratory data analysis or EDA for short we'll assign the feature and target data to x and y respectively will then build a data frame of the feature data using PD data frame and also passing column names viewing the head of the data frame shows us the first five rows now we'll do a bit of visual EDA we use the pandas function scatter matrix to visualize our data set we pass it our data frame along with our target variable as argument to the parameter C which stands for color ensuring that our data points in our figure will be colored by their species we also pass a list to fig size which specifies the size of our figure as well as a marker size and shape the result is a matrix of figures which on the diagonal are histograms of the features corresponding to the row and column the off-diagonal figures are scatter plots of the column feature versus row feature colored by the target variable there is a great deal of information in this scatter matrix see here for example that petal width and length are highly correlated as you may expect and that flowers are clustered according to species now it's your turn to dive into a few exercises and to do some EDA then we'll be back to do some machine learning enjoylet's now jump into our first data set it contains data pertaining to iris flowers in which the features consist of 4 measurements petal length petal width sepal length and sepal width the target variable includes the species of flower and there are three possibilities versicolor virginica and cito so as this is one of the data sets included in scikit-learn we'll import it from there with from SK learn import data sets in the exercises you'll get practice at importing files from your local file system for supervised learning we'll also import pandas numpy and pipeline under their standard aliases in addition we'll set the plotting style to GG plot using PLT dot style dot use firstly because it looks great and secondly in order to help all you are aficionados feel at home we then load the data set with data sets load iris and assign the data to a variable iris checking out the type of iris we see that it's a bunch which is similar to a dictionary in that it contains key value pairs printing the keys we see that they are the feature names desk which provides a description of the data set the target names the data which contains the values features and the target which is the target data as you see here both the feature and target data are provided as an umpire raise the dot shape attribute of the array feature tells us that there are a hundred and fifty rows and four columns remember samples are in rows features are in columns thus we have a hundred fifty samples and the four features petal length and width and sepal length and width moreover note that the target variable is included as zero for setosa one for versicolor and two for virginica we can see this by printing iris dot target names in which Satou sir corresponds to index zero versicolor to index 1 and virginica to index two in order to perform some initial exploratory data analysis or EDA for short we'll assign the feature and target data to x and y respectively will then build a data frame of the feature data using PD data frame and also passing column names viewing the head of the data frame shows us the first five rows now we'll do a bit of visual EDA we use the pandas function scatter matrix to visualize our data set we pass it our data frame along with our target variable as argument to the parameter C which stands for color ensuring that our data points in our figure will be colored by their species we also pass a list to fig size which specifies the size of our figure as well as a marker size and shape the result is a matrix of figures which on the diagonal are histograms of the features corresponding to the row and column the off-diagonal figures are scatter plots of the column feature versus row feature colored by the target variable there is a great deal of information in this scatter matrix see here for example that petal width and length are highly correlated as you may expect and that flowers are clustered according to species now it's your turn to dive into a few exercises and to do some EDA then we'll be back to do some machine learning enjoy\n"

Python Tutorial - Exploratory data analysis

Random Videos