Machine Learning in R - Building a Classification Model

The Model and Prediction Application: A Step-by-Step Guide to Evaluating Iris Flower Classification Performance

Building the Model

In this step, we will build the model using the provided data set. We start by importing the necessary libraries, including pandas for data manipulation, numpy for numerical computations, and scikit-learn for machine learning tasks. Next, we load the data into a pandas DataFrame, which provides a convenient interface for data manipulation and analysis.

Performing the Prediction

After building the model, we need to perform the prediction on the testing set. We will use the trained model to make predictions on the unseen data, which will help us evaluate its performance. This step is crucial in understanding how well the model generalizes to new, unseen data.

Cross-Validation

To improve the accuracy of our evaluation, we will use cross-validation, a technique that involves splitting the data into multiple subsets and training and testing the model on each subset separately. We will use 10-fold cross-validation, which means we will split the data into 10 subsets and train the model on 9 subsets while testing it on the remaining subset.

Model Evaluation

Once we have performed the prediction and used cross-validation to evaluate the model's performance, we need to assess its accuracy. We can do this by using a confusion matrix, which provides a summary of the true positive, false positive, true negative, and false negative predictions made by the model. The goal is to compare the predicted labels with the actual labels.

Interpreting the Confusion Matrix

The resulting confusion matrix will be a three-by-three table, where each column represents the actual class label, and each row represents a predicted class label. Ideally, we would want a diagonal of 40-40-40 for perfect prediction, indicating that all instances were correctly classified as their true class.

Performance Metrics

However, in real-world scenarios, the data distribution might be imbalanced, making accuracy not an ideal metric to evaluate the model's performance. We need to consider other metrics such as sensitivity, specificity, positive predictive value, negative predictive value, prevalence, detection rate, and balance accuracy for each of the three flower classes.

Feature Importance

After evaluating the model's performance, we can analyze which features contributed most to its predictions. We will use the feature importance function from scikit-learn to determine which variables were most important for predicting the class labels.

Example Output

The code will generate output in a readable format:

Confusion Matrix:

| | Versicolor | Virginica |

| --- | --- | --- |

| Predicted Versicolor | 40 | 1 |

| Predicted Virginica | 0 | 40 |

Accuracy: 0.9967

Sensitivity by Class:

| Class | True Positive Rate |

| --- | --- |

| Versicolor | 0.9 |

| Virginica | 1.0 |

Specificity by Class:

| Class | True Negative Rate |

| --- | --- |

| Versicolor | 0.3333 |

| Virginica | 1.0 |

Feature Importance:

* Petal Length: 0.43

* Petal Width: 0.31

* Sepal Length: 0.16

The feature importance analysis shows that petal length and width are the most important variables for predicting class labels, while sepal length has a lesser influence.

Conclusion

In this article, we have walked through the process of building, training, and evaluating an Iris flower classification model using scikit-learn in Python. We applied cross-validation to improve the accuracy of our evaluation and considered feature importance to understand which variables contributed most to the model's predictions. The resulting performance metrics provide a comprehensive understanding of the model's strengths and weaknesses, enabling us to refine the model further and achieve better results.

"WEBVTTKind: captionsLanguage: enwelcome back to the data professor YouTube channel if you new here my name is Tennant aston ahmad and i'm an associate professor of bioinformatics on this youtube channel we cover about data science concepts and practical tutorials so if you're into this kind of content please consider subscribing so in this video we're going to cover about how you can build your own classification model using our and also the iris data set in this video we're going to use the carrot package which represents a collection of machine learning algorithms and we're gonna use the iris data set as an example okay so without further ado let's get started okay so head over to the github of the data professor links are down below in the description so what you want to do now is you want to import the diarist dataset so you want to run the library dataset and then you want to import the carrot package as well and then the next step is to import the iris dataset so you can just control in turn you see that you have your own data set and you want to check whether your data set has any missing values so a good practice is to check for that so what you can do is run the summation function as I have mentioned in the previous video and then you add the function is dot and a which will determine whether there will be any missing value or not in your iris data frame so can swinter okay and when you build your classification model by default our will assign random numbers to the model seed number so what you want to do is that you want to set the seed number to a fixed number okay in this example we're going to use the 100 so the benefit of this is that when you rerun the same model 100 times or 2,000 times or 10,000 times you will get the same result but if you don't set the seat number each time you run the code you will get different results so for we put the stability of the model we will set the seed to a fixed number it would set that to 100 okay and then the next step we're going to do some data splitting and you might be wondering why do we need to split the data because we want to simulate the situation in which we have a data set which we used to train the model and then we want to see whether the we'll be applicable to future data so what we want to do is we have the Irish data set so we're gonna split into two parts one part will be the training set which we will use to create a training model and then we going to apply the training model to predict the class label in the 20% of the testing center okay so how are we exactly going to split the data we're going to split it into two parts the first part will be the training set which will represent 80% of the data sample and the second part is the testing set which will represent 20% of the data set and so we're gonna use the 80 percent to build a training model and then we're going to apply the training model to predict the class label of the testing set which contains 20% of the data set okay so how are we going to do that we're going to define a variable called training in depth so essentially we're going to then use the create data partition function and then we're going to specify the name of the class label which is the species variable so then we say Irish dollar sign species okay so that's the class label comma P equals to 0.8 so this will specify that 80% will go into the training set and then list to be false okay so let's run this and see what happens click on the 20 index so what we get is the index of the data set so essentially it's kind of like the ID of each role okay each flour will be assigned an ID a unique ID okay because there is 150 flowers so we're going to get 120 roles here as you can see it's randomly selected out of the 150 flowers and 120 flower roll we're percent 80% of the data set whereas the remaining 20% will be in the testing set which will be in these subsequent lines of code so what we do here is then within the bracket we're going to specify the name of the training index which we have seen previously that it is essentially a list of the ID in the iris data set and in order to create a subset containing 80% of the original iris data set we're going to specify the following lines of we're going to create a variable called training set and within there we're going to type in iris bracket and then training in that comma and in closing bracket and so we're going to get the 80% subset of the initial 150 flower which will contain 120 flowers okay and then for the testing set we're going to add a - a - in front of the 20 index and then we're going to get the remaining 20% and so let's have a look at the testing set which contains here as you see 30 observation and 5 variables so here you will have a total of 30 flowers okay and in the training set you have a total of 120 flowers which represents the 80% okay here there's 120 flowers okay 150 is the ID number okay if you do a count it's 120 flower for the training set and 30 flowers for the testing set so how about I give this ass your homework so you can copy the code from the previous video on how you can make a scatterplot and then create scatterplot for the 80% subset and the 20% subset which represents the training set and the testing set so you want to visualize how this the data actually look like for the 80% subset and for the 20% subset okay so this is your homework so give it a try and let me know in the comments whether the distribution are roughly the same or are they quite different so let's scroll down to the code and then we will see that the following code will allow us to build a support vector machine model using the polynomial kernel so as I mentioned previously that the carrot package contains a lot of machine learning algorithms and there's more than twenty or thirty algorithms that you can choose from and so in this example I'm going to use the support vector machine polynomial kernel and for each learning algorithm there will be hyper parameters that you have to optimize but for this tutorial is beyond the scope so I'm not going to cover that here so we're going to just set the default of the C parameters one and getting scale equal to one degree equals to one so we're not gonna optimize the parameters here okay so in this block of code it will create the necessary parameters needed to create your very own classification model okay so here we are going to use the train function and then parentheses and then the species is the class label name so let's say that your data set has a class label name of species therefore you will specify species here however if your class label is something else you have to replace this with that particular variable name okay and then data equals to the training set which is the 80% sub set that you will use to create the training model but then the method will be the SVM polynomial kernel and whenever there is a missing value it will perform omission it will just delete it and prior to building the model it will perform some data pre-processing which means it will scale the data according to the mean centering so what it will do is for each variable it will compute the mean value and then it was a track each value of each role it was subtract with the mean value of each column okay so for each variable you will have a mean of 0 so after you subtract each value by its column mean value and then the resulting mean if you compute it for the second time it will be 0 so I'm gonna go over this basic concept in future videos and I will show you graphically but you can also try that yourself in an excel file and let me know down below in the comments how it goes ok so because we're going to build the training model there for the train control function the argument we use will be method equals to none and this is the training model so you might be wondering what what is a training model and below we have this thing called CV model what's a CV model okay so let me explain it like this a training model will essentially use the training set to build the model ok the training set as we recall it's the 80% subset of the data set so we're gonna use the 80% subset to build a training model once we have the training model built we will apply this training model to predict the class label in the testing said which represent the 20% subset of the original dataset okay so we applied a training model to predict the class label of the testing set in K so that's the external set which we will cover down below so this training model will allow us to predict two different data set it will allow us to predict the training set itself and also to predict the testing stand it might sound a bit confusing right now but I'm going to cover it down below as well in the subsequent lines of code however for a cross-validation model what we will essentially do is we will set the K fold to be 10-fold cross-validation by 10-fold cross-validation what does this mean so the input data will be the training set which represents 80% of the subsets right so it will use the 80% which is containing 120 flowers and so the cross-validation will be 10 fold so 120 flower will be divided into 10 sub groups and each sub group will contain 12 flowers okay let's just say that I have 10 Saboo which will represent each of my 10 fingers let's say that I take out this as the left out fold okay I'm gonna let leave this out so I have the remaining 9 okay I have the remaining 9 I'm gonna use this 9 to create a training model and once I've done that I'm gonna use the training model to predict the class label of 12 flowers in this subgroup okay this will represent the first iteration and then the second iteration I will take this sub group back and I will leave this the second group out okay so the first sub group which was left out in the first iteration will now constitute among one of the nine groups in the training model okay so I'm gonna use the now 9 sub group to create the training model and once I created the training model I will use it to predict the flower the 12 flower in the second sub group this is the second iteration and I do this on and on until each of the group will be left out one time okay and then average over the prediction performance so that will be the cross-validation okay so once you have built the model then you want to apply the model for the prediction okay so the code up here will allow us to build a training model and allow us to build a cross-validation model okay in the next three lines we will be able to perform various prediction tasks so the first line modeled training I will use the predict function so I will use the model that was obtained here which is called model okay so model here that's the argument and then I will apply this prediction model to predict the class label of the hundred and twenty flower of the training set so this represents the training set prediction okay that's the first prediction model that I will obtain and then the second prediction model that I will obtain is called the model testing okay so I'm going to apply the training model to predict the class label of the 30 flowers okay so this is the second prediction okay so we call again the first prediction model we'll use 120 flowers to create a prediction model and apply that prediction model to predict 120 flowers okay so that represent the first model which is called a model training set and then the second model is the model testing set will allow me to take the 120 flower create a prediction model and apply the prediction model on the 30 flowers which is the 20% subset and in the third prediction model I will use the training set which has 120 flowers as I have mentioned already partitioning into 10 part I performed for 10 iteration where each iteration I we use nine-fold to create a training model leave 1 fold out right and then apply the prediction model to predict the left our group and perform this 10 time and average over the performance so that will be the cross-validation so let's run each of the lines of code okay I haven't run a training model yet so let's built the model built the model okay perform the prediction perform the prediction of the three models and then the next step is to look at the prediction performance so I will do that by creating a variable called model dog training dog confusion and I'm going to use the confusion matrix function with the capital M for matrix and then my arguments here will be modeled on training comma training set dollar sign species which is representing the class label okay and then the second performance will be model testing and then comma second argument will be testing set dollar sign species okay and then the third performance will be the model dot CV because we use the cross-validation model and then training set dollar sign species okay so let's run each of that control enter and so enter and then I'm going to print the resulting performance and here we go so this first block of result will be the confusion matrix so the confusion matrix will be a three by three table so each column will represent the actual class label and each row will represent a predicted class label so theoretically in order to have perfect prediction we should have a diagonal of 40 40 40 because there's a total of 40 seat OSA for the versicolor and 40 virginica and so we see that on the second line there are 40 versicolor that has been predicted to be 40 versicolor whereas one of the virginica is predicted to be diversity color as we have seen here one okay so we see that from this confusion matrix one of the virginica is miss predicted to be a versicolor so the overall statistic is that the accuracy is zero nine nine one seven so that's the one value which tells us the performance of the prediction model and as I will show in a future video accuracy is not the best way to measure the performance of the model particularly if your data set is imbalanced okay I'm going to tell you which prediction performance metric you should be using in order to evaluate such results without any bias of the distribution that are unequal for the class labels and below there is the statistics by class so it will give us the sensitivity specificity positive predictive value negative predictive value prevalence detection rate detection prevalence and the balance accuracy for each of the three flower class so in order to look at the performance of each of them three model I will just control enter and then I will see the performance right so the accuracy of the testing set is 0.9 667 and I get to see the confusion matrix is small so here I see that out of the ten versicolor nine of them have been predicted to be versicolor whereas one of the adversity color is predicted to be a virginica okay so here we also see that confusion of the prediction model and let's have a look at the cross-validation set okay performance is also pretty good and accuracy 0.99 one seven and then the next block of code would give me the feature importance so I will create a variable called importance and I will be using the far iymp function okay and let's have to plot up the importance so I'm going to see for each of the iris flower class I will see which variable were the most important for the prediction for all of the iris flower we can see that the petal length was the most important variable followed by the petal width followed by the sepal length for virginica the cipa where did not influence the prediction result where asked for satou-san versicolor all four variables were affecting the prediction performance whereas pedaling and petal width or the most important feature for all three flower subclass whereas Sippel lane is important for c Tosa and virginica and to a lesser degree versicolor or as simple wit was only influencing see Tulsa and versicolor but not virginica if this will allow us to see the contribution of the importance of the variable to the prediction of the class label so you see that the font color is blue and if you want to change it to red at the argument Co l equals to ad and in the circle will be red color okay so this concludes this video and in future video I will be showing you more our data science project so please stay tuned thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videoswelcome back to the data professor YouTube channel if you new here my name is Tennant aston ahmad and i'm an associate professor of bioinformatics on this youtube channel we cover about data science concepts and practical tutorials so if you're into this kind of content please consider subscribing so in this video we're going to cover about how you can build your own classification model using our and also the iris data set in this video we're going to use the carrot package which represents a collection of machine learning algorithms and we're gonna use the iris data set as an example okay so without further ado let's get started okay so head over to the github of the data professor links are down below in the description so what you want to do now is you want to import the diarist dataset so you want to run the library dataset and then you want to import the carrot package as well and then the next step is to import the iris dataset so you can just control in turn you see that you have your own data set and you want to check whether your data set has any missing values so a good practice is to check for that so what you can do is run the summation function as I have mentioned in the previous video and then you add the function is dot and a which will determine whether there will be any missing value or not in your iris data frame so can swinter okay and when you build your classification model by default our will assign random numbers to the model seed number so what you want to do is that you want to set the seed number to a fixed number okay in this example we're going to use the 100 so the benefit of this is that when you rerun the same model 100 times or 2,000 times or 10,000 times you will get the same result but if you don't set the seat number each time you run the code you will get different results so for we put the stability of the model we will set the seed to a fixed number it would set that to 100 okay and then the next step we're going to do some data splitting and you might be wondering why do we need to split the data because we want to simulate the situation in which we have a data set which we used to train the model and then we want to see whether the we'll be applicable to future data so what we want to do is we have the Irish data set so we're gonna split into two parts one part will be the training set which we will use to create a training model and then we going to apply the training model to predict the class label in the 20% of the testing center okay so how are we exactly going to split the data we're going to split it into two parts the first part will be the training set which will represent 80% of the data sample and the second part is the testing set which will represent 20% of the data set and so we're gonna use the 80 percent to build a training model and then we're going to apply the training model to predict the class label of the testing set which contains 20% of the data set okay so how are we going to do that we're going to define a variable called training in depth so essentially we're going to then use the create data partition function and then we're going to specify the name of the class label which is the species variable so then we say Irish dollar sign species okay so that's the class label comma P equals to 0.8 so this will specify that 80% will go into the training set and then list to be false okay so let's run this and see what happens click on the 20 index so what we get is the index of the data set so essentially it's kind of like the ID of each role okay each flour will be assigned an ID a unique ID okay because there is 150 flowers so we're going to get 120 roles here as you can see it's randomly selected out of the 150 flowers and 120 flower roll we're percent 80% of the data set whereas the remaining 20% will be in the testing set which will be in these subsequent lines of code so what we do here is then within the bracket we're going to specify the name of the training index which we have seen previously that it is essentially a list of the ID in the iris data set and in order to create a subset containing 80% of the original iris data set we're going to specify the following lines of we're going to create a variable called training set and within there we're going to type in iris bracket and then training in that comma and in closing bracket and so we're going to get the 80% subset of the initial 150 flower which will contain 120 flowers okay and then for the testing set we're going to add a - a - in front of the 20 index and then we're going to get the remaining 20% and so let's have a look at the testing set which contains here as you see 30 observation and 5 variables so here you will have a total of 30 flowers okay and in the training set you have a total of 120 flowers which represents the 80% okay here there's 120 flowers okay 150 is the ID number okay if you do a count it's 120 flower for the training set and 30 flowers for the testing set so how about I give this ass your homework so you can copy the code from the previous video on how you can make a scatterplot and then create scatterplot for the 80% subset and the 20% subset which represents the training set and the testing set so you want to visualize how this the data actually look like for the 80% subset and for the 20% subset okay so this is your homework so give it a try and let me know in the comments whether the distribution are roughly the same or are they quite different so let's scroll down to the code and then we will see that the following code will allow us to build a support vector machine model using the polynomial kernel so as I mentioned previously that the carrot package contains a lot of machine learning algorithms and there's more than twenty or thirty algorithms that you can choose from and so in this example I'm going to use the support vector machine polynomial kernel and for each learning algorithm there will be hyper parameters that you have to optimize but for this tutorial is beyond the scope so I'm not going to cover that here so we're going to just set the default of the C parameters one and getting scale equal to one degree equals to one so we're not gonna optimize the parameters here okay so in this block of code it will create the necessary parameters needed to create your very own classification model okay so here we are going to use the train function and then parentheses and then the species is the class label name so let's say that your data set has a class label name of species therefore you will specify species here however if your class label is something else you have to replace this with that particular variable name okay and then data equals to the training set which is the 80% sub set that you will use to create the training model but then the method will be the SVM polynomial kernel and whenever there is a missing value it will perform omission it will just delete it and prior to building the model it will perform some data pre-processing which means it will scale the data according to the mean centering so what it will do is for each variable it will compute the mean value and then it was a track each value of each role it was subtract with the mean value of each column okay so for each variable you will have a mean of 0 so after you subtract each value by its column mean value and then the resulting mean if you compute it for the second time it will be 0 so I'm gonna go over this basic concept in future videos and I will show you graphically but you can also try that yourself in an excel file and let me know down below in the comments how it goes ok so because we're going to build the training model there for the train control function the argument we use will be method equals to none and this is the training model so you might be wondering what what is a training model and below we have this thing called CV model what's a CV model okay so let me explain it like this a training model will essentially use the training set to build the model ok the training set as we recall it's the 80% subset of the data set so we're gonna use the 80% subset to build a training model once we have the training model built we will apply this training model to predict the class label in the testing said which represent the 20% subset of the original dataset okay so we applied a training model to predict the class label of the testing set in K so that's the external set which we will cover down below so this training model will allow us to predict two different data set it will allow us to predict the training set itself and also to predict the testing stand it might sound a bit confusing right now but I'm going to cover it down below as well in the subsequent lines of code however for a cross-validation model what we will essentially do is we will set the K fold to be 10-fold cross-validation by 10-fold cross-validation what does this mean so the input data will be the training set which represents 80% of the subsets right so it will use the 80% which is containing 120 flowers and so the cross-validation will be 10 fold so 120 flower will be divided into 10 sub groups and each sub group will contain 12 flowers okay let's just say that I have 10 Saboo which will represent each of my 10 fingers let's say that I take out this as the left out fold okay I'm gonna let leave this out so I have the remaining 9 okay I have the remaining 9 I'm gonna use this 9 to create a training model and once I've done that I'm gonna use the training model to predict the class label of 12 flowers in this subgroup okay this will represent the first iteration and then the second iteration I will take this sub group back and I will leave this the second group out okay so the first sub group which was left out in the first iteration will now constitute among one of the nine groups in the training model okay so I'm gonna use the now 9 sub group to create the training model and once I created the training model I will use it to predict the flower the 12 flower in the second sub group this is the second iteration and I do this on and on until each of the group will be left out one time okay and then average over the prediction performance so that will be the cross-validation okay so once you have built the model then you want to apply the model for the prediction okay so the code up here will allow us to build a training model and allow us to build a cross-validation model okay in the next three lines we will be able to perform various prediction tasks so the first line modeled training I will use the predict function so I will use the model that was obtained here which is called model okay so model here that's the argument and then I will apply this prediction model to predict the class label of the hundred and twenty flower of the training set so this represents the training set prediction okay that's the first prediction model that I will obtain and then the second prediction model that I will obtain is called the model testing okay so I'm going to apply the training model to predict the class label of the 30 flowers okay so this is the second prediction okay so we call again the first prediction model we'll use 120 flowers to create a prediction model and apply that prediction model to predict 120 flowers okay so that represent the first model which is called a model training set and then the second model is the model testing set will allow me to take the 120 flower create a prediction model and apply the prediction model on the 30 flowers which is the 20% subset and in the third prediction model I will use the training set which has 120 flowers as I have mentioned already partitioning into 10 part I performed for 10 iteration where each iteration I we use nine-fold to create a training model leave 1 fold out right and then apply the prediction model to predict the left our group and perform this 10 time and average over the performance so that will be the cross-validation so let's run each of the lines of code okay I haven't run a training model yet so let's built the model built the model okay perform the prediction perform the prediction of the three models and then the next step is to look at the prediction performance so I will do that by creating a variable called model dog training dog confusion and I'm going to use the confusion matrix function with the capital M for matrix and then my arguments here will be modeled on training comma training set dollar sign species which is representing the class label okay and then the second performance will be model testing and then comma second argument will be testing set dollar sign species okay and then the third performance will be the model dot CV because we use the cross-validation model and then training set dollar sign species okay so let's run each of that control enter and so enter and then I'm going to print the resulting performance and here we go so this first block of result will be the confusion matrix so the confusion matrix will be a three by three table so each column will represent the actual class label and each row will represent a predicted class label so theoretically in order to have perfect prediction we should have a diagonal of 40 40 40 because there's a total of 40 seat OSA for the versicolor and 40 virginica and so we see that on the second line there are 40 versicolor that has been predicted to be 40 versicolor whereas one of the virginica is predicted to be diversity color as we have seen here one okay so we see that from this confusion matrix one of the virginica is miss predicted to be a versicolor so the overall statistic is that the accuracy is zero nine nine one seven so that's the one value which tells us the performance of the prediction model and as I will show in a future video accuracy is not the best way to measure the performance of the model particularly if your data set is imbalanced okay I'm going to tell you which prediction performance metric you should be using in order to evaluate such results without any bias of the distribution that are unequal for the class labels and below there is the statistics by class so it will give us the sensitivity specificity positive predictive value negative predictive value prevalence detection rate detection prevalence and the balance accuracy for each of the three flower class so in order to look at the performance of each of them three model I will just control enter and then I will see the performance right so the accuracy of the testing set is 0.9 667 and I get to see the confusion matrix is small so here I see that out of the ten versicolor nine of them have been predicted to be versicolor whereas one of the adversity color is predicted to be a virginica okay so here we also see that confusion of the prediction model and let's have a look at the cross-validation set okay performance is also pretty good and accuracy 0.99 one seven and then the next block of code would give me the feature importance so I will create a variable called importance and I will be using the far iymp function okay and let's have to plot up the importance so I'm going to see for each of the iris flower class I will see which variable were the most important for the prediction for all of the iris flower we can see that the petal length was the most important variable followed by the petal width followed by the sepal length for virginica the cipa where did not influence the prediction result where asked for satou-san versicolor all four variables were affecting the prediction performance whereas pedaling and petal width or the most important feature for all three flower subclass whereas Sippel lane is important for c Tosa and virginica and to a lesser degree versicolor or as simple wit was only influencing see Tulsa and versicolor but not virginica if this will allow us to see the contribution of the importance of the variable to the prediction of the class label so you see that the font color is blue and if you want to change it to red at the argument Co l equals to ad and in the circle will be red color okay so this concludes this video and in future video I will be showing you more our data science project so please stay tuned thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videos\n"

Machine Learning in R - Building a Classification Model

Random Videos