Bioinformatics Project from Scratch - Drug Discovery Part 5 (Compare Models)

**Part 1: Calculating Lapinski Descriptors**

In this part of the episode, we're going to discuss how you can calculate Lapinski descriptors used for evaluating the likelihood of being a drug-like molecule. These descriptors are an essential tool in chemoinformatics and provide valuable insights into the properties of molecules. To begin with, it's essential to understand what Lapinski descriptors are and how they were developed. The Lapinski descriptors were introduced by Zbigniew Lapinski, a Polish-American biochemist who worked at the University of Illinois at Urbana-Champaign.

The Lapinski descriptors are based on the concept of "descriptors" in chemoinformatics, which refer to numerical values that can be calculated from molecular structures. These descriptors are used to describe the properties of molecules and can be used to predict their potential biological activity. The Lapinski descriptors were developed using a combination of machine learning algorithms and chemical intuition.

To calculate the Lapinski descriptors, we need access to a dataset of compounds with known biological activities. This dataset is typically provided by researchers in the field of chemoinformatics. Once you have this dataset, you can use software packages such as Paddle to calculate the Lapinski descriptors. Paddle is a popular open-source software package for calculating molecular descriptors.

After obtaining the Lapinski descriptors, we can perform some basic exploratory data analysis (EDA) on these descriptors. EDA involves visualizing and summarizing the distribution of the descriptors in order to gain insights into their properties and behavior. This can be done using simple box plots and scatter plots.

**Part 2: Exploratory Data Analysis**

In this part of the episode, we're going to discuss how you can perform exploratory data analysis (EDA) on the Lapinski descriptors. EDA is an essential step in chemoinformatics, as it allows us to understand the properties and behavior of molecular descriptors.

To begin with, we need to visualize the distribution of the Lapinski descriptors using box plots and scatter plots. These visualizations can help us identify patterns and trends in the data that might not be immediately apparent from a summary of the mean and standard deviation values.

Box plots are particularly useful for visualizing the distribution of multiple variables, as they provide a sense of central tendency and variability. By plotting the box plots side by side, we can compare the distributions of different Lapinski descriptors to identify similarities and differences between them.

Scatter plots, on the other hand, are used to visualize the relationship between two variables at a time. These plots can help us understand how changes in one variable affect another variable. For example, we might plot the Lapinski descriptor "hydrophobicity" against the molecular weight of the compound.

By analyzing the distribution and relationships between the Lapinski descriptors, we can gain insights into their properties and behavior that will be useful for downstream analysis and modeling tasks. This includes identifying outliers and extreme values, as well as understanding how different descriptors interact with each other.

**Part 3: Target Protein and Machine Learning**

In this part of the episode, we're going to discuss how you can use machine learning algorithms to predict biological activity based on molecular properties. We'll be using the Acetylcholine Esterase (AChE) protein as our target protein, which provides a larger dataset for modeling compared to some other proteins.

To begin with, we need to calculate the molecular descriptors of the compounds in our dataset using software packages such as Paddle. These descriptors provide valuable information about the properties and behavior of the molecules, including their hydrophobicity, polar surface area, and logD values.

Once we have calculated the molecular descriptors, we can split our dataset into training and testing sets using a 80:20 ratio. This is essential for evaluating the performance of machine learning models, as it allows us to assess how well they generalize to unseen data.

Next, we'll use the pandas library to load the datasets and assign them to x and y variables. We'll also use the seaborn library to visualize the distribution of our molecular descriptors and understand their properties better.

**Part 4: Machine Learning Comparison**

In this part of the episode, we're going to compare several machine learning algorithms using the Lazy Predict library to predict biological activity based on molecular properties. The goal is to determine which algorithm performs best in predicting the biological activity of compounds with different molecular structures.

To begin with, we'll import the necessary libraries and load our dataset into x and y variables. We'll also split our dataset into training and testing sets using a 80:20 ratio.

Next, we'll use the Lazy Predict library to build models based on each of the machine learning algorithms available in the library. This includes Random Forest, Gradient Boosting, LightGBM, and XGBoost.

For each model, we'll evaluate its performance using metrics such as R-squared values and RMSE. We'll also calculate the training time for each model to understand how computationally intensive they are.

Finally, we'll visualize our results using bar plots and scatter plots to compare the performance of different machine learning algorithms. By comparing these results, we can determine which algorithm performs best in predicting biological activity based on molecular properties.

By completing this episode, you should now have a good understanding of how to calculate Lapinski descriptors, perform exploratory data analysis, use machine learning algorithms to predict biological activity, and compare the performance of different models using Lazy Predict.

"WEBVTTKind: captionsLanguage: enwelcome back to part 5 of the bioinformatics project from scratch series where i show you how you could build your own computational drug discovery model using the machine learning algorithm in today's episode i will be showing you how you could compare several machine learning algorithms for building regression models of the acetylcholine esterase inhibitors and today we're going to be using a lazy and efficient way of building several machine learning algorithms and this was shown in a recent video using the lazy predict python library and so we're going to be using that for today's tutorial and before proceeding further let's do a quick recap so in part one i have shown you how you could collect original data set in biology that you could use in your own data science project particularly i have demonstrated to you how you could download and pre-process the biological activity data from the chambo database and the dataset is comprised of compounds and molecules that have been biologically tested for their activity toward the target organism or protein of interest then in part two i have shown to you how you could calculate the lapinski descriptors which are descriptors used for evaluating the likelihood of being a drug-like molecule and then i've shown to you how you could perform some basic exploratory data analysis on these lipinski descriptors particularly the eda are based on making simple box plot and scatter plot in order to visualize the differences of the active and inactive subset of the compound in part 3 i have made some changes to the target protein and then we're using the acetylcholine esterase as it provides a larger data set to work with and so in this part we have already computed the molecular descriptors using the paddle descriptor software and then we prepared the data set comprising of the x and y data frames and then we use that to build a prediction model in the subsequent parts which is part four where we use the descriptors generated from part three in order to build a regression model using the random forest algorithm and now to today's episode let's get started so here we're going to be comparing several machine learning algorithms using the lazy predict library and so the first thing that you need to do is install the lazy predict and so in a prior video i've shown you how you could use the lazy predict to do a quick and rapid model building of classification and also regression models in just a few lines of code and so let's start by installing the library okay and so we have already installed it and then we're going to be importing the necessary libraries and so here we're using the pandas seaborne and also the second learn library specifically we're importing the train test split function and then we're going to be importing the lazy predict and also the lazy regressor function and so now we're going to be loading up the data set and we're going to be directly downloading it from the github of data professor and so the links is here wget to download it and now we're going to be reading in the file and then we're going to be assigning it to the df data frame then we're going to be splitting it up into the x and y variables and let's take a look at the dimension of the x variable and so here we see that it has a total of 4 695 rolls or the number of compounds in the data set and it has a total of 881 descriptors or the features or the number of columns and so the first thing that we need to do is we're going to be removing the low variance features so those that have low variance and let's take a look at the dimension of the data set again and so we have a reduced subset from 881 to be 137 variables now we're going to be performing a data split using the 80 20 ratio all right now comes the fun part so as you can see here we're going to be building more than 20 machine learning models and so we're using only two lines of code here so the first one is like any other scikit-learn functions for building the model is to assign the machine learning algorithm into a classifier variable and then we're going to be assigning the results from the prediction after we built the model and then we're assigning it to the train and test variables so the train and test variables here will be containing the performance of the model's prediction and so let's build the model so here it has 39 models 39 machine learning algorithms so this might take some time because the data is relatively big at almost 5 000 rows and so it should be noted here that model building is using default parameters for all of the 39 algorithms used and so if you want to perform hyper parameter optimization that will be a topic for another video right and so models have been built and let's have a look at the train okay so lg bm is the best model here so from our prior tutorials random forest was used for the model building and so here it had slightly better performance let's have a look at the test set lgbm regressor random forest also at third place here but the thing is they're roughly the same okay 0.57 and 0.56 let's have a look at the data visualization of the model performance and so the bar plot of the r-squared values is provided here and we're going to have a look at the rmse values here and then we're also going to have look at the calculation time provided here so the longer the bars become the longer it takes to build the model all right and so congratulations we have already built several machine learning models for comparison and if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journeywelcome back to part 5 of the bioinformatics project from scratch series where i show you how you could build your own computational drug discovery model using the machine learning algorithm in today's episode i will be showing you how you could compare several machine learning algorithms for building regression models of the acetylcholine esterase inhibitors and today we're going to be using a lazy and efficient way of building several machine learning algorithms and this was shown in a recent video using the lazy predict python library and so we're going to be using that for today's tutorial and before proceeding further let's do a quick recap so in part one i have shown you how you could collect original data set in biology that you could use in your own data science project particularly i have demonstrated to you how you could download and pre-process the biological activity data from the chambo database and the dataset is comprised of compounds and molecules that have been biologically tested for their activity toward the target organism or protein of interest then in part two i have shown to you how you could calculate the lapinski descriptors which are descriptors used for evaluating the likelihood of being a drug-like molecule and then i've shown to you how you could perform some basic exploratory data analysis on these lipinski descriptors particularly the eda are based on making simple box plot and scatter plot in order to visualize the differences of the active and inactive subset of the compound in part 3 i have made some changes to the target protein and then we're using the acetylcholine esterase as it provides a larger data set to work with and so in this part we have already computed the molecular descriptors using the paddle descriptor software and then we prepared the data set comprising of the x and y data frames and then we use that to build a prediction model in the subsequent parts which is part four where we use the descriptors generated from part three in order to build a regression model using the random forest algorithm and now to today's episode let's get started so here we're going to be comparing several machine learning algorithms using the lazy predict library and so the first thing that you need to do is install the lazy predict and so in a prior video i've shown you how you could use the lazy predict to do a quick and rapid model building of classification and also regression models in just a few lines of code and so let's start by installing the library okay and so we have already installed it and then we're going to be importing the necessary libraries and so here we're using the pandas seaborne and also the second learn library specifically we're importing the train test split function and then we're going to be importing the lazy predict and also the lazy regressor function and so now we're going to be loading up the data set and we're going to be directly downloading it from the github of data professor and so the links is here wget to download it and now we're going to be reading in the file and then we're going to be assigning it to the df data frame then we're going to be splitting it up into the x and y variables and let's take a look at the dimension of the x variable and so here we see that it has a total of 4 695 rolls or the number of compounds in the data set and it has a total of 881 descriptors or the features or the number of columns and so the first thing that we need to do is we're going to be removing the low variance features so those that have low variance and let's take a look at the dimension of the data set again and so we have a reduced subset from 881 to be 137 variables now we're going to be performing a data split using the 80 20 ratio all right now comes the fun part so as you can see here we're going to be building more than 20 machine learning models and so we're using only two lines of code here so the first one is like any other scikit-learn functions for building the model is to assign the machine learning algorithm into a classifier variable and then we're going to be assigning the results from the prediction after we built the model and then we're assigning it to the train and test variables so the train and test variables here will be containing the performance of the model's prediction and so let's build the model so here it has 39 models 39 machine learning algorithms so this might take some time because the data is relatively big at almost 5 000 rows and so it should be noted here that model building is using default parameters for all of the 39 algorithms used and so if you want to perform hyper parameter optimization that will be a topic for another video right and so models have been built and let's have a look at the train okay so lg bm is the best model here so from our prior tutorials random forest was used for the model building and so here it had slightly better performance let's have a look at the test set lgbm regressor random forest also at third place here but the thing is they're roughly the same okay 0.57 and 0.56 let's have a look at the data visualization of the model performance and so the bar plot of the r-squared values is provided here and we're going to have a look at the rmse values here and then we're also going to have look at the calculation time provided here so the longer the bars become the longer it takes to build the model all right and so congratulations we have already built several machine learning models for comparison and if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey\n"