How to examine chance prediction of a machine learning model (Y-Scrambling _ Y-Permutation)
Visualizing Model Performance using Histograms and Scrambled Data Sets
In this article, we will explore how to visualize model performance using histograms and scrambled data sets. We will use Python and its popular libraries, seaborn and matplotlib, to create a histogram plot that compares the original data set with 1000 shuffled or scrambled versions of the data.
We start by importing the necessary libraries, including seaborn as sns and matplotlib.pyplot as plt. We then set the style of the plot to white color and adjust the image size to make it wider and taller. This will provide a better visual representation of the data. Next, we create the histogram plot using the hist() function from seaborn, passing in the r2_train list as the input argument.
The histogram plot shows the distribution of the scrambled models, with 1000 bars representing each model's performance. The color sky blue is used to make the plot visually appealing. We also add a vertical line corresponding to the original model's r^2 value to compare it with the other models. However, we notice that the r^2 value on the x-axis is not showing correctly, as it has been overwritten by the scrambled model.
To fix this issue, we rename the r^2 value of the scrambled model to r2b and use r2a for the original model. We then run the code again, passing in r2a, which gives us an r^2 value of 0.76. When we append r2b to the list, the result is still 0.76, as the seat number has been set.
We then zoom in on the histogram plot, adjusting the x-axis limit to focus on the range between 0 and 0.1. We also change the number of bins from 10 to 5, which results in fewer bars. By making these adjustments, we can better visualize the distribution of the scrambled models and compare it with the original model.
The histogram plot shows that the scrambled models have a very poor performance, with values distributed near zero. In contrast, the original model has a high r^2 score of 0.76. This demonstrates that the shuffled data set is not performing by chance, but rather making good predictions.
This technique can be applied to both regression and classification problems. Instead of using the r^2 score, you can use accuracy scores or MCC scores to evaluate model performance. We encourage readers to try this experiment on another data set and share their results in the comments section.
By visualizing model performance using histograms and scrambled data sets, we can gain insights into how our models are making predictions. This is an essential step in machine learning and data science, as it allows us to identify areas for improvement and refine our models to achieve better performance.
"WEBVTTKind: captionsLanguage: enwhen you're building a machine learning model and your model is performing very good have you ever wondered whether the prediction or classification or regression is the performance occurring by chance and so in this video we will be examining that doubt by exploring the concept of why scrambling also known as y shuffling or y permutation and in plain english it means that the y column will be subjected to the shuffling of the values and this will result in the creation of fake xy pairs meaning that the x and y pairs will not be valid because we're randomly moving the values from the y column to be in a random order and so what used to be the correct xy pair will now be replaced with another y value and therefore the xy pair that you see in the permutated data set will be wrong and so the permutated data set that we will be using for building machine learning models is expected to yield pretty poor performance and so without further ado we're starting right now okay so the data set that we're going to be using today is the solubility data set and actually i've also made a prior video on this channel and so i'm going to provide you the links in the video description okay so let's get started so the first thing that you want to do is you want to read in the data sets but before that let's import pandas as pd and here we're going to be taking in the delani solubility data set directly from github of data professor and then we're going to be assigning it to the data set variable and here you're going to be seeing the five columns here where the four columns that you see here comprising of moloch p molecular weight non-rotatable bonds aromatic proportion these will constitute the x variable and the log s will be the y variable so it's going to be the column that we're going to be predicting as a function of the four prior columns okay and so by performing y scrambling right here we're going to be shuffling the order of log s and by shuffling let's think about like a deck of cards where you're shuffling the order of each card in a deck and so the values here will be shuffled like for example the value at position one could be moved to position five and then position five could be moved to position ten and position ten could be moved to position thirtieth right and precision thirty f could be shifted to position twenty five so the ordering will be shuffled in a random manner okay so let me show you here so first let's take the x and y directly from the data set data frame and so here to get x we're going to just simply drop the block s column right here we're going to drop the block s column and so all of here will be x okay so the first four columns will be x and the last column will be y and we are doing it right here all right and so the scrambling component is right here okay so we're going to take the y variable the y variable and then we're going to apply the sample function and as input argument we're going to use the frac equals to 1. and so this means that we're going to apply this to the entire column and then we're going to have replace equal to false and then we're going to have the random state equal to zero okay so let's take a look let's compare the original and the y scrambled model and so firstly here we're going to be building the original model x y pair okay so we're going to be importing the necessary libraries so we're going to be importing the train test split and this will be performing the data split linear regression so we're going to be building a simple linear regression model and because the data set log s is essentially a quantitative value we're going to be performing a regression model building and so in order to measure the performance of the model we're going to use the r2 score okay so as previously shown above we're going to create the x and y variables directly from the data set data frame so simply dropping the lock s column and then we get the x and if we select the last column here denoted by the index location of -1 we're going to assign the last column to the y variable and then we're going to be performing data splits using the train test split function where the input argument will be x and y and so we're going to be using a split ratio of 80 20 and therefore we will assign a value of 0.2 for the test size means that 20 will go to the testing set and the remaining 80 will go to the training set and here for reproducibility we're going to set the seed number to zero and so in the original x y pair we're not performing any form of y shuffling and therefore we're leaving it blank here okay so you will see in just a moment that when we perform y shuffling we're going to add another line of code right here and now we're going to be building the model and so we will be initiating the model by creating a variable called model and then we're going to use linear regression function and then we're going to apply this model to build the model by using the fit function and as input argument we're going to be using xtreme and y train which represents the 80 of the data sets and so after running this line of code we will have already built the model and now we're going to be applying that model to make a prediction using the predict function and as input argument we're going to be using x train and so also the 80 subset and so after predictions have been made by the model on the xtrain subset we're going to be assigning the predicted value into the y train thread variable and so the next step is to compute the r2 score or essentially the r squared value and so the r square is the goodness of fit and it is a squared value of the pearson's correlation coefficient and in order to do that we're going to use the r2 score function from the sklearn.metrics and so as input argument we're going to be using y train which is the actual value and then y train thread which is the predicted value and then we will be assigning the r2 score into the r2 variable and then finally we will be printing it out here and then let's run it and so the r squared value will yield a value of 0.7692 okay and so let's have a look at the plot of the predicted versus actual and this is also taken from the prior video so do check that video out and so here you're going to be seeing the experimental versus predicted so it's the actual value versus the predicted value and so you're going to be seeing pretty good prediction performance here okay and now let's build the y scrambled model and so here we're going to be expecting that the y scrambled model will perform worser than the original data set or the original model and so let's get started here so like the original model we're going to be importing the necessary libraries and so the first thing that you notice here is that we're creating an empty list for the r2 scores that we will be getting from the permutated runs or the y scrambled models the performance of the y scrambled models will be saved into this variable okay and so the steps here is performed in the exact same manner as the original model so the code looks essentially the same and then all of here will look quite similar except for the addition of this line and also the addition of this line where we are going to be using the for loop in order to iterate through a thousand cycle where each iteration i will start from the value of zero up until the value of 999 which will complete the 1000 iteration at each iteration the random state here value will be changed and so upon changing the seed number or the random states it means that the shuffling will also change and so when the shuffling of the y column changes then we can also expect that the resulting model will also change and so at the end here we're going to be using the list here which is the empty list and then we're going to be appending the r2 score okay and so the r2 score is derived when we're using the r2 score function where we're using as input argument the y train and the y train thread as also mentioned earlier on which comes from the application of the model that was trained using linear regression and using the training sets or the 80 subsets and so finally we will be taking a look at the list of the 1000 r squared value and so let's have a look all right and so here you can see that each line here represents a shuffled or a scrambled or a permutated model so we're going to have a thousand member in this particular list okay a thousand members okay and in order to visualize this let's make use of the histogram plot okay so we're going to import seaborn as sns and in order to customize the plot we're going to be importing matplotlib dot pi plot as plt and we'll be setting the style to white color and then we're adjusting the image size so that it is 20 wide and two high and then here we're going to be creating the histogram plot using the hist plot function where the data input argument will be the r2 train list so the list that you have saw above here and then we're going to be making use of the color of sky blue and we're also going to show the kde overlaid on top of the histogram bars and here we're using a value of 10 for the bins okay so bins will essentially generate the bars that you will see and then we're also going to be adding a vertical line corresponding to the r squared value of the original model in comparison to the histogram bars from the scrambled models the 1000 scrambled models which will be binned as you will see in just a moment okay and finally we will be setting the limit of the x-axis to be within the range of zero and one okay so let's run this code okay so it seems that the r2 value is not showing right here at 0.76 and i think it's because we have overwritten the r2 value in the scrambled model here so why don't i rename this let's call it r2b and then we call here r2b and then for the original model let's call it r2a and let's use r2a okay let's run it again all right and so it's 0.7692 and let's run this again now we have r2b here and we append it to be r2b run it again okay because we have set the seat number it means that the result will be the same as previously and let's run this again so this should be r2a now okay there you go and so here r2a here will be taking the line here 0.76 which is the performance from the original model and the shuffled model or the scrambled model will be coming from here the r2 train list and so it's the histogram bars that you see right here okay so let me zoom it in for you let me change the x limit to be 0 and 0.1 and so here you see it right here okay so the number of bins that you see is 10 right now so actually there's like small bins right here okay so if we make it like five then you're gonna see fewer number of bars okay if we make it like 20 then the number of bars will be more and if we zoom out okay we barely see it because there's so many bars and the bars are pretty thin let's make it 10 again okay so there you have it you have the scrambled model which is expected to perform quite poor and they're distributed near zero and the original model has a very high r2 score of 0.76 and so here you see this disparity of the expected poor performance of the model and the good performance of the original data sets and so there will be an expected different distribution if you're using a different data set so try this code on another data set and let me know in the comments how the distribution of the scrambled model whether they changed or not and definitely you can try this on a classification problem and instead of using the r2 score you could use the accuracy score or the mcc score okay so please do try it out and feel free to share with all of us here what kind of results that you're getting so a quick recap of the video so this video shows you how the models are making the prediction when we're comparing between the original data sets with the permutated or the shuffled data set where particularly the y column are subjected to permutation or shuffling where the values of the rows in the y column are shuffled where the ordering are changed and so now we can see the histogram bars of all of the 1000 models and they're shown in the histogram and the actual data sets or the original data set are shown by this vertical line and so this provides us a quick visualization that our model is not performing by chance and so there's not a random prediction and that they're actually making good prediction here if you're finding value in this video please give it a thumbs up subscribe if you haven't already hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journeywhen you're building a machine learning model and your model is performing very good have you ever wondered whether the prediction or classification or regression is the performance occurring by chance and so in this video we will be examining that doubt by exploring the concept of why scrambling also known as y shuffling or y permutation and in plain english it means that the y column will be subjected to the shuffling of the values and this will result in the creation of fake xy pairs meaning that the x and y pairs will not be valid because we're randomly moving the values from the y column to be in a random order and so what used to be the correct xy pair will now be replaced with another y value and therefore the xy pair that you see in the permutated data set will be wrong and so the permutated data set that we will be using for building machine learning models is expected to yield pretty poor performance and so without further ado we're starting right now okay so the data set that we're going to be using today is the solubility data set and actually i've also made a prior video on this channel and so i'm going to provide you the links in the video description okay so let's get started so the first thing that you want to do is you want to read in the data sets but before that let's import pandas as pd and here we're going to be taking in the delani solubility data set directly from github of data professor and then we're going to be assigning it to the data set variable and here you're going to be seeing the five columns here where the four columns that you see here comprising of moloch p molecular weight non-rotatable bonds aromatic proportion these will constitute the x variable and the log s will be the y variable so it's going to be the column that we're going to be predicting as a function of the four prior columns okay and so by performing y scrambling right here we're going to be shuffling the order of log s and by shuffling let's think about like a deck of cards where you're shuffling the order of each card in a deck and so the values here will be shuffled like for example the value at position one could be moved to position five and then position five could be moved to position ten and position ten could be moved to position thirtieth right and precision thirty f could be shifted to position twenty five so the ordering will be shuffled in a random manner okay so let me show you here so first let's take the x and y directly from the data set data frame and so here to get x we're going to just simply drop the block s column right here we're going to drop the block s column and so all of here will be x okay so the first four columns will be x and the last column will be y and we are doing it right here all right and so the scrambling component is right here okay so we're going to take the y variable the y variable and then we're going to apply the sample function and as input argument we're going to use the frac equals to 1. and so this means that we're going to apply this to the entire column and then we're going to have replace equal to false and then we're going to have the random state equal to zero okay so let's take a look let's compare the original and the y scrambled model and so firstly here we're going to be building the original model x y pair okay so we're going to be importing the necessary libraries so we're going to be importing the train test split and this will be performing the data split linear regression so we're going to be building a simple linear regression model and because the data set log s is essentially a quantitative value we're going to be performing a regression model building and so in order to measure the performance of the model we're going to use the r2 score okay so as previously shown above we're going to create the x and y variables directly from the data set data frame so simply dropping the lock s column and then we get the x and if we select the last column here denoted by the index location of -1 we're going to assign the last column to the y variable and then we're going to be performing data splits using the train test split function where the input argument will be x and y and so we're going to be using a split ratio of 80 20 and therefore we will assign a value of 0.2 for the test size means that 20 will go to the testing set and the remaining 80 will go to the training set and here for reproducibility we're going to set the seed number to zero and so in the original x y pair we're not performing any form of y shuffling and therefore we're leaving it blank here okay so you will see in just a moment that when we perform y shuffling we're going to add another line of code right here and now we're going to be building the model and so we will be initiating the model by creating a variable called model and then we're going to use linear regression function and then we're going to apply this model to build the model by using the fit function and as input argument we're going to be using xtreme and y train which represents the 80 of the data sets and so after running this line of code we will have already built the model and now we're going to be applying that model to make a prediction using the predict function and as input argument we're going to be using x train and so also the 80 subset and so after predictions have been made by the model on the xtrain subset we're going to be assigning the predicted value into the y train thread variable and so the next step is to compute the r2 score or essentially the r squared value and so the r square is the goodness of fit and it is a squared value of the pearson's correlation coefficient and in order to do that we're going to use the r2 score function from the sklearn.metrics and so as input argument we're going to be using y train which is the actual value and then y train thread which is the predicted value and then we will be assigning the r2 score into the r2 variable and then finally we will be printing it out here and then let's run it and so the r squared value will yield a value of 0.7692 okay and so let's have a look at the plot of the predicted versus actual and this is also taken from the prior video so do check that video out and so here you're going to be seeing the experimental versus predicted so it's the actual value versus the predicted value and so you're going to be seeing pretty good prediction performance here okay and now let's build the y scrambled model and so here we're going to be expecting that the y scrambled model will perform worser than the original data set or the original model and so let's get started here so like the original model we're going to be importing the necessary libraries and so the first thing that you notice here is that we're creating an empty list for the r2 scores that we will be getting from the permutated runs or the y scrambled models the performance of the y scrambled models will be saved into this variable okay and so the steps here is performed in the exact same manner as the original model so the code looks essentially the same and then all of here will look quite similar except for the addition of this line and also the addition of this line where we are going to be using the for loop in order to iterate through a thousand cycle where each iteration i will start from the value of zero up until the value of 999 which will complete the 1000 iteration at each iteration the random state here value will be changed and so upon changing the seed number or the random states it means that the shuffling will also change and so when the shuffling of the y column changes then we can also expect that the resulting model will also change and so at the end here we're going to be using the list here which is the empty list and then we're going to be appending the r2 score okay and so the r2 score is derived when we're using the r2 score function where we're using as input argument the y train and the y train thread as also mentioned earlier on which comes from the application of the model that was trained using linear regression and using the training sets or the 80 subsets and so finally we will be taking a look at the list of the 1000 r squared value and so let's have a look all right and so here you can see that each line here represents a shuffled or a scrambled or a permutated model so we're going to have a thousand member in this particular list okay a thousand members okay and in order to visualize this let's make use of the histogram plot okay so we're going to import seaborn as sns and in order to customize the plot we're going to be importing matplotlib dot pi plot as plt and we'll be setting the style to white color and then we're adjusting the image size so that it is 20 wide and two high and then here we're going to be creating the histogram plot using the hist plot function where the data input argument will be the r2 train list so the list that you have saw above here and then we're going to be making use of the color of sky blue and we're also going to show the kde overlaid on top of the histogram bars and here we're using a value of 10 for the bins okay so bins will essentially generate the bars that you will see and then we're also going to be adding a vertical line corresponding to the r squared value of the original model in comparison to the histogram bars from the scrambled models the 1000 scrambled models which will be binned as you will see in just a moment okay and finally we will be setting the limit of the x-axis to be within the range of zero and one okay so let's run this code okay so it seems that the r2 value is not showing right here at 0.76 and i think it's because we have overwritten the r2 value in the scrambled model here so why don't i rename this let's call it r2b and then we call here r2b and then for the original model let's call it r2a and let's use r2a okay let's run it again all right and so it's 0.7692 and let's run this again now we have r2b here and we append it to be r2b run it again okay because we have set the seat number it means that the result will be the same as previously and let's run this again so this should be r2a now okay there you go and so here r2a here will be taking the line here 0.76 which is the performance from the original model and the shuffled model or the scrambled model will be coming from here the r2 train list and so it's the histogram bars that you see right here okay so let me zoom it in for you let me change the x limit to be 0 and 0.1 and so here you see it right here okay so the number of bins that you see is 10 right now so actually there's like small bins right here okay so if we make it like five then you're gonna see fewer number of bars okay if we make it like 20 then the number of bars will be more and if we zoom out okay we barely see it because there's so many bars and the bars are pretty thin let's make it 10 again okay so there you have it you have the scrambled model which is expected to perform quite poor and they're distributed near zero and the original model has a very high r2 score of 0.76 and so here you see this disparity of the expected poor performance of the model and the good performance of the original data sets and so there will be an expected different distribution if you're using a different data set so try this code on another data set and let me know in the comments how the distribution of the scrambled model whether they changed or not and definitely you can try this on a classification problem and instead of using the r2 score you could use the accuracy score or the mcc score okay so please do try it out and feel free to share with all of us here what kind of results that you're getting so a quick recap of the video so this video shows you how the models are making the prediction when we're comparing between the original data sets with the permutated or the shuffled data set where particularly the y column are subjected to permutation or shuffling where the values of the rows in the y column are shuffled where the ordering are changed and so now we can see the histogram bars of all of the 1000 models and they're shown in the histogram and the actual data sets or the original data set are shown by this vertical line and so this provides us a quick visualization that our model is not performing by chance and so there's not a random prediction and that they're actually making good prediction here if you're finding value in this video please give it a thumbs up subscribe if you haven't already hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey\n"