Data Science for Computational Drug Discovery using Python (Part 2 with PyCaret)

The Great Importance of Model Interpretation in Data Science

One of the most significant challenges in data science is model interpretation. As we develop and train our models, it's easy to forget about the importance of understanding how they work and which features are contributing the most to their performance. This is where feature importance plots come in. A feature importance plot provides us with a visual representation of how each feature contributes to the model's predictions.

In this article, we'll delve into the world of feature importance plots and explore how they can help us better understand our models. We'll also discuss the importance of using shape libraries like Plotly to create interactive and informative visualizations. Additionally, we'll examine a real-world example of using a feature importance plot to gain insights into a machine learning model.

Feature Importance Plots: A Visual Representation

A feature importance plot is a graphical representation that shows how each feature contributes to the model's predictions. The plot typically consists of two panels: one for the feature importance and another for the correlation between features. By examining these plots, we can gain insights into which features are most important for our model and how they interact with each other.

In this example, we're using a plotly shape library to create an interactive feature importance plot. The plot shows that log p is the most variable feature, followed by more weight aromatic proportion and number of rotatable bonds. However, what's not immediately apparent from this plot is which features are contributing the most to our model.

Using Shape Libraries for Model Interpretation

One of the strengths of shape libraries like Plotly is their ability to provide detailed insights into our models. By using the shape library, we can create a forest plot that shows the push and pull effect each feature has on the base value of our predictions. This plot provides us with a visual representation of how each feature contributes to our model's performance.

In this example, the forest plot shows that all four descriptors are making the value lower, indicating a negative effect on the output value prediction. However, what's interesting is that different models may use different features in different ways, and some features may even push the value higher or lower depending on the specific model.

Model Interpretation: Beyond Feature Importance Plots

While feature importance plots provide valuable insights into our models, they don't tell us everything we need to know. To truly understand how our models work, we need to dig deeper and examine other metrics, such as the correlation between features and the push and pull effect each feature has on the base value of our predictions.

In this example, we're using a correlation plot to examine the relationships between different features. The plot shows that there are strong correlations between some of the descriptors, indicating that they may be interacting with each other in complex ways.

Making Predictions with Our Model

Now that we've gained insights into our model's performance and behavior, it's time to put them to use. We're using the trained model to make predictions on a new dataset, and the results are promising. The performance on the test set is 0.8671, indicating that our model is making accurate predictions.

Let's take a closer look at some of the predicted outputs and compare them to the actual values. In this example, the predicted value is -5.08 for an actual value of -5.47. While there's a discrepancy between the two values, it's clear that our model is making accurate predictions in most cases.

Conclusion

Model interpretation is a critical component of data science, and feature importance plots provide a valuable tool for gaining insights into our models. By examining features importance plots and using shape libraries like Plotly, we can gain a deeper understanding of how our models work and which features are contributing the most to their performance. Additionally, by examining other metrics, such as correlation between features and the push and pull effect each feature has on the base value of our predictions, we can develop more accurate and informative models.

In conclusion, model interpretation is an essential skill for data scientists, and feature importance plots are a powerful tool for gaining insights into our models. By combining these tools with shape libraries like Plotly, we can develop more accurate and informative models that provide valuable results in real-world applications.

"WEBVTTKind: captionsLanguage: enso a couple of weeks ago i've shown you how you could use the python library called the pi carat which is a low code machine learning library in python and the example data set that i've shown you was the iris data set and so perhaps you're wondering what if you have your own data set and you would like to use that for your machine learning model building in pi karen so the advantage of using pi caret is that it allows you to quickly prototype machine learning models meaning that you could quickly generate machine learning models based on several algorithms for example if you're using classification or if you're using regression then you could pretty much use all of the learning algorithms that are available from scikit learn and other packages such as xt boost or cat boost and so in my daytime job as an associate professor bioinformatics i also do a lot of research and as part of doing research we have to explore various machine learning algorithms and traditionally if we were to custom code machine learning pipelines in python using scikit-learn that would pretty much take us maybe a couple of hours but by using pi carrier this could pretty much be compressed in only a couple of minutes and so in today's video i'm going to show you how you could quickly generate machine learning models in pi caret and so the data set that we're going to be using today which is particularly based on the molecular solubility and so the data set was originally published in one of the journals by the american chemical society and so today we're going to reproduce that work and instead of using only linear regression we're also going to be using several machine learning algorithms provided by the automated pipeline of pi carrot and so the links to both of these videos that i have mentioned will be provided up and below in the description and so without further ado we're starting right now okay so in this tutorial i'm going to split the jupyter notebook file into two and the first one will be called part 2.1 and the second one will be called part 2.2 and so in part 2.1 we're going to be generating the molecular descriptor file which has also been covered in the previous video but today we're going to do that very rapidly and so if you would like to have a in-depth discussion or explanation of what each line of code is doing then you want to refer to that previous video and so please find the link up and below okay so let's get started we're going to install conda and the libraries and so this should take a couple of minutes and as previously shown in the private video we're using the dilani solubility dataset and the original paper is provided in this link and the original link to the data set is provided in this link as well in 2.1 and so i have already downloaded the data set onto the github of the data professor and so the links will be in here and so we could directly read it in using pandas and so thanks to boris for suggesting the use of the url directly in the read csv function of pandas okay let's have a look it's currently installing okay so we're installing conda python 3.7 and we're also installing rd kits and so rdkit will allow us to compute molecular descriptors and so this is a chem informatic python library and if you are curious about artikit this library is only available in python so if you're using r then you wanna give this package a try and and so perhaps it is one of the reasons for using python for chem informatics and vice versa there are packages only available for r and so to each their own so some library are in python some library are only exclusively on r and so that is why i'm using both languages okay so let's compute the descriptors okay we have to define this first read in the data set right here click the molecular descriptor okay so lengthier explanation will be provided in the previous video as mentioned so here we're just computing the descriptors and we're splitting the x and y matrices looking at the distribution and then combining it back in and then we're gonna write it out into a csv file and this csv file will be provided on the github of data professor so i have shown you this steps just in case that you would like to try this on your own chemical library okay so we have already created the file and i'm going to show you the link to this file which we are going to be using for part 2.2 okay so it's right here delani solubility with descriptors so the first one delany.csv let me show you so this will contain the raw data it's going to contain the name of the compounds the measured solubility the predicted solubility from the paper using the linear regression and the smiles notation and the smiles notation will be chemical information in the one dimension and so the descriptor calculating software in rd kit library will be using the smallest notation as the input okay and then it will generate a set of molecular descriptor which will be shown here so it's going to generate descriptors such as log p molecular weights number of rotatable bonds aromatic proportion and log s is the experimental values and so as noted previously in the prior video the aromatic proportion descriptor was generated using a custom function okay so this is the data set we're going to be using for part 2.2 all right so let's start by installing pi caret in a couple of seconds it will finish installation all right so let's load in the pi carrot library for regression and then we're gonna import pandas as pd and this is the link to the solubility data set i've mentioned along with the molecular descriptors that were calculated let's have a brief look okay so it's the same data set that i've previously mentioned so it has five columns and 1144 rows all right so let's start with the model building so the first step is okay so i have already run the above so why don't i delete this because i have already included here okay and then we're going to set up the model let's run it and so here we specify the name of the data set which is called data set here because we read it in as a data set data frame and then we're going to specify the target which is log s and so log s is right here the column called log s so this is the y variable that we will be predicting and the rest will be used as the x variable okay and notice that this is a simple pandas data frame and so we just read it indirectly from the csv file and then we're using it immediately in the setup function of pi caret and so here we're specifying the training set size to be eighty percent okay and we're making it silent equal to true so that we don't see any of the messages okay and let's proceed so the subsequent blocks of code here we'll be using the training set which is the 80 and then finally at the very end of this notebook i'm going to be using the trained model of the 80 in order to be testing it on the 20 okay so if you're new to machine learning and data science then i have created a simple visual guide to how to build a machine learning model so that article was published on medium and towards data science and so i'm going to provide you the link in the video description as well and so that article will be a gentle introduction to the field of machine learning and data science okay so let's continue hit on this cell compare models so as i mentioned previously we're going to be building several machine learning algorithms model that are provided by scikit-learn by catbooz by xgboost and so this is simply performed in only one line of code made possible by pi current and imagine if you were to code this manually all of these 21 models then it's gonna take you a couple of minutes if not minutes then a couple of hours at least a few hours okay so this is very conveniently done for you all right so let's have a look here so here we see that the best model is provided by the extra tree regression and this gives us a r square of 0.879 and let's compare that to the previous one that we have built let's go to code go to python and then it is the chem informatics predicting solubility and so this is from the previous video all right so let's have a look so the r square here is it's the r square on the test set right here r square on the training set is 77 0.77 and the one provided today is 0.879 a significant boost to the model performance and so we're going to continue further with the et regression okay and coming in in second place is random forest so as noted in one of the prior podcasts i've mentioned that random forest was one of my favorite learning algorithms and actually without using pie carrot i wouldn't have known that extra tree regression would have performed better than random forest and so this gave us fresh perspective in trying out new algorithms as well okay and so here we're going to continue to use the et algorithm which is the best performing one and it is abbreviated right here as et and so i'm going to define it et equals to create underscore model and then et let's run it okay and here we have a performance table showing all of the performance metric for the 10 cross validation and the mean value from the 10 cross validation is shown here along with the standard deviation so as you can see generally it is 0.8793 same as above left two in the model so by tuning the model we're going to optimize the parameters and let's see whether the performance will increase okay so this should take a couple of moments and so here we have set the number of iterations to be 50 and we're going to use the mean absolute error as the fitness function and so we're seeing that the r square is increasing in some of the fold okay and then from the 10 cross validation we saw that the performance increased slightly to 0.8854 and so it was previously eight seven nine three and it is now eight eight five four so this is the detail from the trained model and so for reproducibility you might wanna set the random state to be 7903 in order for it to give you the same results here okay so now comes the fun part let's have a look at the various plots for the models so let's have a look at the residuals and we're going to use the plot model function and the input argument here will be the name of the model et comma the residuals and so imagine creating this manually using seaborne or matpotlib so that might take you a couple of hours to do so okay and let's have a look at the prediction error plot okay so the scatter plot of the actual value and the predicted value cook's distance plots so for outlier detection recursive feature selection and in the background we're going to run the other one as well all right here so this is the recursive feature selection so it is shown here that out of the four molecule descriptors it is shown here that the use of only two feature could provide in excess of 0.85 for the r square and that the use of the remaining two descriptor will slightly improve the performance of the prediction okay so that's something interesting to see and let's have a look at the learning curve so the blue curve you see here are the training score and the green will be the cross validated score validation curve comparing the training score versus the cross validation score and the manifold learning plot using four features and the feature importance so we can see here that the log p is the highly ranked feature followed by the molecular weight aromatic proportion and rotatable bonds and so probably the two feature are log p and molecular weight from the above plot here right here number feature to be two and so these are the hyper parameters of the model number of estimators 100 random state 7903 and this is the hyper parameter of the tuned model so map step 40. number of estimator has been changed to 280 and the random state is the same at 7903 okay and here is the showing all the plots so you could click on each of these panel and it will show us but some of them were not working or maybe it's taking some time to run so let's continue model interpretation so the great thing i like about this package is the nice interpretation provided by the shape library and so here we're seeing the contribution of the features to the model and so as previously shown above generally the feature importance plot that we will be seeing will provide us only the information about which one was the most important and so an important point to note here is that whenever we make a feature importance plot we're going to see which feature are the most important for example we could see that moloch p provided the most variable importance followed by more weight aromatic proportion and number of rotatable bonds but what we're not seeing is that how are they contributing to the model for example if we have two classes class a and class b active compound and inactive compound and so we could see that log p is contributing the most but are they contributing the most to the active compound or are they contributing the most to the inactive compound okay and so with the shape package here we're gonna see the contribution of each feature looking at the shape value here okay whether it is bending toward the negative or whether it is tending toward the positive okay and let's have a look further okay this is the correlation plot and let's have a look here at the recent plot at the observation level and so this recent plot which is caught by pi turret and it is better known as the forest plot which is termed by the shape library and the plot will essentially describe the push and pull effect that each of the individual feature used to build the model has on the base value of the prediction so the base value of the prediction let's think of it as kind of like the y-intercept so y-intercept could be thought of as the base value for example if your equation is y equals to five x plus five and so the base value will be five and the coefficient value five x that will be the feature importance okay so for a simple linear regression there's no problem in interpreting at which direction does each feature has on the prediction of the model whether it has a positive effect or a negative effect which we could have a look at the coefficient values whether it's positive or negative or whether the value has high magnitude or low magnitude so high magnitude meaning it will have higher value for the coefficient and lower magnitude it will have lower value for the coefficient and so the force plot will beautifully show you that in this plot so the base value here is six so the base value here is minus 6.72 and we could see that all of the descriptor here are making the value lower okay and so different model will be using different features in different ways okay and so here we're going to see that the four feature are pushing the base value lower so it's having a negative effect toward the output value prediction and so for this particular model it is showing that all of the four descriptors are having a negative effect on the output value so it is pushing the value to be lower from the base value of -6.72 so for another data set using other algorithms it might be the case that some descriptors are pushing it higher some descriptors are pushing it lower okay okay so that's all for the testing on the 80 subset and now we're gonna use the 80 model and making a prediction on the hold out or the left out 20 subsets and so let's do that using the predict model code and so we're going to see that the performance on the test set or the 20 is 0.8671 and so let's have a look at some of the predicted output here so here are the label which is the predicted value and the log s are the experimental values okay so the prediction are pretty good right the actual value is minus 5.47 predicted to be minus 5.08 minus 2.18 predicted to be minus 1.9772 okay so if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videosso a couple of weeks ago i've shown you how you could use the python library called the pi carat which is a low code machine learning library in python and the example data set that i've shown you was the iris data set and so perhaps you're wondering what if you have your own data set and you would like to use that for your machine learning model building in pi karen so the advantage of using pi caret is that it allows you to quickly prototype machine learning models meaning that you could quickly generate machine learning models based on several algorithms for example if you're using classification or if you're using regression then you could pretty much use all of the learning algorithms that are available from scikit learn and other packages such as xt boost or cat boost and so in my daytime job as an associate professor bioinformatics i also do a lot of research and as part of doing research we have to explore various machine learning algorithms and traditionally if we were to custom code machine learning pipelines in python using scikit-learn that would pretty much take us maybe a couple of hours but by using pi carrier this could pretty much be compressed in only a couple of minutes and so in today's video i'm going to show you how you could quickly generate machine learning models in pi caret and so the data set that we're going to be using today which is particularly based on the molecular solubility and so the data set was originally published in one of the journals by the american chemical society and so today we're going to reproduce that work and instead of using only linear regression we're also going to be using several machine learning algorithms provided by the automated pipeline of pi carrot and so the links to both of these videos that i have mentioned will be provided up and below in the description and so without further ado we're starting right now okay so in this tutorial i'm going to split the jupyter notebook file into two and the first one will be called part 2.1 and the second one will be called part 2.2 and so in part 2.1 we're going to be generating the molecular descriptor file which has also been covered in the previous video but today we're going to do that very rapidly and so if you would like to have a in-depth discussion or explanation of what each line of code is doing then you want to refer to that previous video and so please find the link up and below okay so let's get started we're going to install conda and the libraries and so this should take a couple of minutes and as previously shown in the private video we're using the dilani solubility dataset and the original paper is provided in this link and the original link to the data set is provided in this link as well in 2.1 and so i have already downloaded the data set onto the github of the data professor and so the links will be in here and so we could directly read it in using pandas and so thanks to boris for suggesting the use of the url directly in the read csv function of pandas okay let's have a look it's currently installing okay so we're installing conda python 3.7 and we're also installing rd kits and so rdkit will allow us to compute molecular descriptors and so this is a chem informatic python library and if you are curious about artikit this library is only available in python so if you're using r then you wanna give this package a try and and so perhaps it is one of the reasons for using python for chem informatics and vice versa there are packages only available for r and so to each their own so some library are in python some library are only exclusively on r and so that is why i'm using both languages okay so let's compute the descriptors okay we have to define this first read in the data set right here click the molecular descriptor okay so lengthier explanation will be provided in the previous video as mentioned so here we're just computing the descriptors and we're splitting the x and y matrices looking at the distribution and then combining it back in and then we're gonna write it out into a csv file and this csv file will be provided on the github of data professor so i have shown you this steps just in case that you would like to try this on your own chemical library okay so we have already created the file and i'm going to show you the link to this file which we are going to be using for part 2.2 okay so it's right here delani solubility with descriptors so the first one delany.csv let me show you so this will contain the raw data it's going to contain the name of the compounds the measured solubility the predicted solubility from the paper using the linear regression and the smiles notation and the smiles notation will be chemical information in the one dimension and so the descriptor calculating software in rd kit library will be using the smallest notation as the input okay and then it will generate a set of molecular descriptor which will be shown here so it's going to generate descriptors such as log p molecular weights number of rotatable bonds aromatic proportion and log s is the experimental values and so as noted previously in the prior video the aromatic proportion descriptor was generated using a custom function okay so this is the data set we're going to be using for part 2.2 all right so let's start by installing pi caret in a couple of seconds it will finish installation all right so let's load in the pi carrot library for regression and then we're gonna import pandas as pd and this is the link to the solubility data set i've mentioned along with the molecular descriptors that were calculated let's have a brief look okay so it's the same data set that i've previously mentioned so it has five columns and 1144 rows all right so let's start with the model building so the first step is okay so i have already run the above so why don't i delete this because i have already included here okay and then we're going to set up the model let's run it and so here we specify the name of the data set which is called data set here because we read it in as a data set data frame and then we're going to specify the target which is log s and so log s is right here the column called log s so this is the y variable that we will be predicting and the rest will be used as the x variable okay and notice that this is a simple pandas data frame and so we just read it indirectly from the csv file and then we're using it immediately in the setup function of pi caret and so here we're specifying the training set size to be eighty percent okay and we're making it silent equal to true so that we don't see any of the messages okay and let's proceed so the subsequent blocks of code here we'll be using the training set which is the 80 and then finally at the very end of this notebook i'm going to be using the trained model of the 80 in order to be testing it on the 20 okay so if you're new to machine learning and data science then i have created a simple visual guide to how to build a machine learning model so that article was published on medium and towards data science and so i'm going to provide you the link in the video description as well and so that article will be a gentle introduction to the field of machine learning and data science okay so let's continue hit on this cell compare models so as i mentioned previously we're going to be building several machine learning algorithms model that are provided by scikit-learn by catbooz by xgboost and so this is simply performed in only one line of code made possible by pi current and imagine if you were to code this manually all of these 21 models then it's gonna take you a couple of minutes if not minutes then a couple of hours at least a few hours okay so this is very conveniently done for you all right so let's have a look here so here we see that the best model is provided by the extra tree regression and this gives us a r square of 0.879 and let's compare that to the previous one that we have built let's go to code go to python and then it is the chem informatics predicting solubility and so this is from the previous video all right so let's have a look so the r square here is it's the r square on the test set right here r square on the training set is 77 0.77 and the one provided today is 0.879 a significant boost to the model performance and so we're going to continue further with the et regression okay and coming in in second place is random forest so as noted in one of the prior podcasts i've mentioned that random forest was one of my favorite learning algorithms and actually without using pie carrot i wouldn't have known that extra tree regression would have performed better than random forest and so this gave us fresh perspective in trying out new algorithms as well okay and so here we're going to continue to use the et algorithm which is the best performing one and it is abbreviated right here as et and so i'm going to define it et equals to create underscore model and then et let's run it okay and here we have a performance table showing all of the performance metric for the 10 cross validation and the mean value from the 10 cross validation is shown here along with the standard deviation so as you can see generally it is 0.8793 same as above left two in the model so by tuning the model we're going to optimize the parameters and let's see whether the performance will increase okay so this should take a couple of moments and so here we have set the number of iterations to be 50 and we're going to use the mean absolute error as the fitness function and so we're seeing that the r square is increasing in some of the fold okay and then from the 10 cross validation we saw that the performance increased slightly to 0.8854 and so it was previously eight seven nine three and it is now eight eight five four so this is the detail from the trained model and so for reproducibility you might wanna set the random state to be 7903 in order for it to give you the same results here okay so now comes the fun part let's have a look at the various plots for the models so let's have a look at the residuals and we're going to use the plot model function and the input argument here will be the name of the model et comma the residuals and so imagine creating this manually using seaborne or matpotlib so that might take you a couple of hours to do so okay and let's have a look at the prediction error plot okay so the scatter plot of the actual value and the predicted value cook's distance plots so for outlier detection recursive feature selection and in the background we're going to run the other one as well all right here so this is the recursive feature selection so it is shown here that out of the four molecule descriptors it is shown here that the use of only two feature could provide in excess of 0.85 for the r square and that the use of the remaining two descriptor will slightly improve the performance of the prediction okay so that's something interesting to see and let's have a look at the learning curve so the blue curve you see here are the training score and the green will be the cross validated score validation curve comparing the training score versus the cross validation score and the manifold learning plot using four features and the feature importance so we can see here that the log p is the highly ranked feature followed by the molecular weight aromatic proportion and rotatable bonds and so probably the two feature are log p and molecular weight from the above plot here right here number feature to be two and so these are the hyper parameters of the model number of estimators 100 random state 7903 and this is the hyper parameter of the tuned model so map step 40. number of estimator has been changed to 280 and the random state is the same at 7903 okay and here is the showing all the plots so you could click on each of these panel and it will show us but some of them were not working or maybe it's taking some time to run so let's continue model interpretation so the great thing i like about this package is the nice interpretation provided by the shape library and so here we're seeing the contribution of the features to the model and so as previously shown above generally the feature importance plot that we will be seeing will provide us only the information about which one was the most important and so an important point to note here is that whenever we make a feature importance plot we're going to see which feature are the most important for example we could see that moloch p provided the most variable importance followed by more weight aromatic proportion and number of rotatable bonds but what we're not seeing is that how are they contributing to the model for example if we have two classes class a and class b active compound and inactive compound and so we could see that log p is contributing the most but are they contributing the most to the active compound or are they contributing the most to the inactive compound okay and so with the shape package here we're gonna see the contribution of each feature looking at the shape value here okay whether it is bending toward the negative or whether it is tending toward the positive okay and let's have a look further okay this is the correlation plot and let's have a look here at the recent plot at the observation level and so this recent plot which is caught by pi turret and it is better known as the forest plot which is termed by the shape library and the plot will essentially describe the push and pull effect that each of the individual feature used to build the model has on the base value of the prediction so the base value of the prediction let's think of it as kind of like the y-intercept so y-intercept could be thought of as the base value for example if your equation is y equals to five x plus five and so the base value will be five and the coefficient value five x that will be the feature importance okay so for a simple linear regression there's no problem in interpreting at which direction does each feature has on the prediction of the model whether it has a positive effect or a negative effect which we could have a look at the coefficient values whether it's positive or negative or whether the value has high magnitude or low magnitude so high magnitude meaning it will have higher value for the coefficient and lower magnitude it will have lower value for the coefficient and so the force plot will beautifully show you that in this plot so the base value here is six so the base value here is minus 6.72 and we could see that all of the descriptor here are making the value lower okay and so different model will be using different features in different ways okay and so here we're going to see that the four feature are pushing the base value lower so it's having a negative effect toward the output value prediction and so for this particular model it is showing that all of the four descriptors are having a negative effect on the output value so it is pushing the value to be lower from the base value of -6.72 so for another data set using other algorithms it might be the case that some descriptors are pushing it higher some descriptors are pushing it lower okay okay so that's all for the testing on the 80 subset and now we're gonna use the 80 model and making a prediction on the hold out or the left out 20 subsets and so let's do that using the predict model code and so we're going to see that the performance on the test set or the 20 is 0.8671 and so let's have a look at some of the predicted output here so here are the label which is the predicted value and the log s are the experimental values okay so the prediction are pretty good right the actual value is minus 5.47 predicted to be minus 5.08 minus 2.18 predicted to be minus 1.9772 okay so if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videos\n"