Bioinformatics Project from Scratch - Drug Discovery Part 3 (Dataset Preparation)
**Navigating the Descriptors Output CSV**
As we work with molecular descriptors, it's essential to have a clear understanding of what each column represents and how to manipulate them effectively. In this article, we'll delve into the details of navigating the descriptors output CSV file.
When working with large datasets, it's common to encounter files that are difficult to read or understand at first glance. The descriptors output CSV file is one such example. With a whopping 4695 rows and columns, it can be daunting to know where to start. However, by breaking down the process into manageable steps, we can unlock the secrets of this data and make sense of its contents.
**Stopping and Downloading the File**
To begin with, let's stop the current script and download the computed file directly from GitHub. We'll head over to the Data Professor GitHub page, navigate to the data repository, and scroll down to find the scripter's output dot CSV file. Clicking on it will allow us to right-click on the download button and save it as a CSV file. By changing the save type to all files and saving it to our desktop, we can access the file easily.
**Loading the File into Python**
Once we've downloaded the file, let's load it into our Python environment using pandas. We'll import the necessary libraries, including pandas and matplotlib, to perform data manipulation and visualization. With the file loaded, we can now focus on cleaning and preparing the data for analysis.
**Preparing the Data**
The first step in preparing the data is to delete any unnecessary columns or rows that don't add value to our analysis. In this case, we'll drop the first column, which represents the molecular name. By using the pandas `drop` function, we can remove this column and assign it back to our DataFrame for further manipulation.
Next, let's create a separate data matrix for the Y-values, which represent the PSC50 values. We'll use the PhD 50 column directly from the f3 data frame and assign it to a new variable in our DataFrame. By doing so, we can keep our code organized and focused on the task at hand.
**Combining X and Y Data**
Now that we have our X and Y data matrices prepared, let's combine them into a single data frame. This will allow us to perform further analysis and modeling using our molecular descriptors. We'll use pandas' `concat` function to concatenate our X and Y data frames, ensuring that they're properly aligned for analysis.
**Outputting the Data Frame**
Finally, we'll output our combined data frame into a CSV file, which we can upload to GitHub for sharing with others. By using pandas' `to_csv` function, we can specify a clear and descriptive name for our file, making it easy to identify and understand its contents.
The resulting CSV file will contain all the molecular descriptors, including Pupkin fingerprints, as well as the corresponding PSC50 values. This will enable us to perform model building and analysis using our molecular descriptors, providing valuable insights into their relationships with biological activity.
**Conclusion**
In this article, we've explored the process of navigating the descriptors output CSV file, from stopping and downloading the file to preparing and analyzing its contents. By breaking down the process into manageable steps and using pandas to manipulate and analyze the data, we can unlock the secrets of our molecular descriptors and make sense of their relationships with biological activity. Whether you're a seasoned data scientist or just starting out, this article provides valuable insights into working with large datasets and performing effective data analysis.
"WEBVTTKind: captionsLanguage: enwelcome back if you new here my name is Tina Natasha Hamad and I'm an associate professor of bioinformatics and this is the data professor YouTube channel okay so in this video I'm going to continue with the part three of the bioinformatics project series where I go through how you can implement a bioinformatics project from scratch so a short recap in part one I showed you how you can retrieve the bioactivity data directly from the Tembo database followed by a quick data pre-processing in part two I've shown you how you can calculate the Lipinsky descriptor and perform exploratory data analysis and this video is part three where I'm going to show you how you can calculate molecular descriptors followed by preparing the data set that we will be using for the next part which is part four and in part four we're going to do some model building and so without further ado let's get started okay so the first thing that you want to do is head over to the github of the data professor and click on the code repository scroll down and click on Python scroll down and notice that I've created three additional files which has the prefix of acetylcholine esterase which is the name of the target protein that our research group have previously published and the great thing about this target protein is that there are an abundance of bio activity data and therefore it will be a great starting point for model building so essentially what I have done is changed the name of the target protein in part 1 and then perform all of the code cells and did the same thing with part 2 by taking in the output from part 1 and then I finally exported the files from part two and then usage for part 3 which is today and so the export of data from part 1 and part 2 will be provided on the github of the data professor so it will be provided here in the data directory so you'll be noticing that there are six additional files containing the name acetylcholine esterase 0 1 until 0 6 okay so let's download the part three assets of choline esterase so you could click on it and then you could right click on the raw link and then save it into your computer because I already have it in my Google collab so I'm gonna use this one okay so let's begin so you're going to see here that in this part three we will be calculating molecular descriptors and then we're going to prepare the data set which will be used for the next part part four and in part four we're going to perform some model building so we are going to need to download the paddle descriptor software which is provided on the github of data professor and I'm going to provide the link to the original website of the developers of the paddle descriptors along with the link to the original research paper okay so the paddle dot zip file has been downloaded along with the paddle SH file which is the shell script file containing instructions on how to run the paddle calculation because here we're gonna use paddle to calculate the molecular descriptors okay so we're gonna unzip the folder okay and so we're going to download the acetylcholine esterase 0-4 file which is containing the Piz fifty along with the three bio activity class all right and so we're gonna import pandas as PD and we're going to import the CSV file and assign it the DF three variable name so let's have a look at this data frame okay so we're gonna select only the canonical smiles column along with the molecule temple ID column and we're gonna put it in the selection variable and then we're going to subset the data by using the f3 bracket selection which contains the name of the precise columns that we want it and then we're gonna assign the name of DF three underscore selection and then we're gonna save it as molecule SMI let's run it let's have a look at the file using the bash okay so it contains the smiles notation here and the name of the molecule so a quick recap these smiles notation represents the chemical information that pertain to the chemical structure so the C here represents the carbon atom the O represents the oxygen the N represents 9 to 10 okay and so let's continue and so in this line we're gonna see how many lines of molecules do we have and we have 4695 which matches this number here for 695 so we wanted to check that all of the rows are coming in the molecule SMI file ok and so we're gonna perform the descriptor calculation by running bash paddle each okay so maybe you're wondering what is inside the paddle file let's have a look paddle dot Sh so it contains the instruction so we're gonna use Java and then we're gonna use 1 gigabyte of memory and we're gonna use this option because we don't have a display on the Google code lab so we're gonna use the java.awt headless equal to true and then we're gonna specify the dash jar function because we're gonna use the pedo descriptor dot jar file and then here we're gonna remove the salt and the salts are the sodium and the chloride which are in the chemical structure and so this program will automatically remove all salts and also small organic acid from the chemical structure so if that sounded gibberish then so it essentially means that we are cleaning the chemical structure so that there are no impurity okay and here are the other options pertaining to how we also clean the chemical structure and then this option tells the program that we're gonna compute the molecular fingerprint and the fingerprint type will be pop chem fingerprint okay and so finally it will output the descriptors into the file called descriptors output CSV and let's run in some molecules are taking five seconds some are 0.6 okay somewhere feel very quick 0.3 because we have 4695 it's going to take you about 18 to 19 minutes to complete so why don't we just stop this and then we can download directly the computed file which is here descriptors output CSV so why don't you head over to the data professor github click on the data repository and then you will see this page and then scroll down and find the scripters output dot CSV click on it and then right click on the download button and save link as and then download it into your computer so we're gonna save it into the desktop and I'm gonna change the save as type to be all files and I'm gonna change txt to be CSV so it's automatically opening it for me okay so we're gonna go back to this notebook and then as you see it's barely up to about 200 so I'm gonna stop this if it allows me to alright so it stopped so let's see if there is any generated file so it has generated some output and so I'm going to delete this because I'm gonna upload the completed version so here I'm going to click on the upload and then this is the desktop and I'm gonna click on the descriptors output dot CSV okay and it's currently uploading wait one moment so let's list the files so okay so it's uploading notice the file science is increasing that's a good sign wait one moment it's fairly big file almost hang in there alright so it's finished so let's have a look again so it's about eight point three megabytes and so we're gonna read the descriptors output into DF three and their score EPS and so here we're gonna prepare the X&Y data matrices and the FS data matrix will comprise of the molecular descriptors which are the Pupkin fingerprints so let's have a look and so we're gonna delete the first column here the name because we want only the molecular features so let's drop it using the dot drop function and the name of the column that we want it to be dropped and then we see that the name column has been dropped and then we reassign it back to DF three underscore X alright so now let's create the Y data matrix and here we're gonna take the PhD 50 column directly from the f3 data frame which is the initially loaded data frame and then we're gonna assign it to the DF 3 underscore Y and then here we're gonna combine x and y together so this really depends but for portability we're gonna combine it and then we're gonna output it into a CSV file which we will be uploading to the github and then we're going to use that for part okay and so here we're gonna output this data set three data frame into a CSV file and notice that the name here is fairly long and the purpose for having such a long name is to allow less confusion and to allow us to easily see what is the purpose of this file and so the first segment of this name is the name of the target protein which is the acetylcholine esterase zero six is just the sequential order number and so bioactivity data three class PhD fifty essentially tells us that it contains the bio activity data information along with three categorical class comprising of active inactive and intermediate and it also contains the PSC fifty values and in the last segment here is pop chem FP signifying that it contains the Pupkin fingerprints so this will become handy when we have more than one fingerprint type and so paddle allows you to compute more than ten different fingerprint types and so let's make it your homework to try to compute other different fingerprint types so you want to play around with the options and see what other molecular fingerprints are available and then you could rename the file accordingly all right so let's see if we have already written out the file and it's right here okay and then we're gonna write out the file so let's run the code cell here and let's save it into our computer all right so it's finished and it's gonna open up for us to see all right let's see so these are the fingerprints of pop chem and the last column is the P ic50 so we're gonna use this file to perform model building in the next episode so please stay tuned to that so support this channel by smashing the like button subscribe if you haven't yet done so and click on the notification bell in order to be notified of the next video and if you have come this far in the video please give yourself a big clap and comment down below that you have watched until the end and big kudos to you guys and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videoswelcome back if you new here my name is Tina Natasha Hamad and I'm an associate professor of bioinformatics and this is the data professor YouTube channel okay so in this video I'm going to continue with the part three of the bioinformatics project series where I go through how you can implement a bioinformatics project from scratch so a short recap in part one I showed you how you can retrieve the bioactivity data directly from the Tembo database followed by a quick data pre-processing in part two I've shown you how you can calculate the Lipinsky descriptor and perform exploratory data analysis and this video is part three where I'm going to show you how you can calculate molecular descriptors followed by preparing the data set that we will be using for the next part which is part four and in part four we're going to do some model building and so without further ado let's get started okay so the first thing that you want to do is head over to the github of the data professor and click on the code repository scroll down and click on Python scroll down and notice that I've created three additional files which has the prefix of acetylcholine esterase which is the name of the target protein that our research group have previously published and the great thing about this target protein is that there are an abundance of bio activity data and therefore it will be a great starting point for model building so essentially what I have done is changed the name of the target protein in part 1 and then perform all of the code cells and did the same thing with part 2 by taking in the output from part 1 and then I finally exported the files from part two and then usage for part 3 which is today and so the export of data from part 1 and part 2 will be provided on the github of the data professor so it will be provided here in the data directory so you'll be noticing that there are six additional files containing the name acetylcholine esterase 0 1 until 0 6 okay so let's download the part three assets of choline esterase so you could click on it and then you could right click on the raw link and then save it into your computer because I already have it in my Google collab so I'm gonna use this one okay so let's begin so you're going to see here that in this part three we will be calculating molecular descriptors and then we're going to prepare the data set which will be used for the next part part four and in part four we're going to perform some model building so we are going to need to download the paddle descriptor software which is provided on the github of data professor and I'm going to provide the link to the original website of the developers of the paddle descriptors along with the link to the original research paper okay so the paddle dot zip file has been downloaded along with the paddle SH file which is the shell script file containing instructions on how to run the paddle calculation because here we're gonna use paddle to calculate the molecular descriptors okay so we're gonna unzip the folder okay and so we're going to download the acetylcholine esterase 0-4 file which is containing the Piz fifty along with the three bio activity class all right and so we're gonna import pandas as PD and we're going to import the CSV file and assign it the DF three variable name so let's have a look at this data frame okay so we're gonna select only the canonical smiles column along with the molecule temple ID column and we're gonna put it in the selection variable and then we're going to subset the data by using the f3 bracket selection which contains the name of the precise columns that we want it and then we're gonna assign the name of DF three underscore selection and then we're gonna save it as molecule SMI let's run it let's have a look at the file using the bash okay so it contains the smiles notation here and the name of the molecule so a quick recap these smiles notation represents the chemical information that pertain to the chemical structure so the C here represents the carbon atom the O represents the oxygen the N represents 9 to 10 okay and so let's continue and so in this line we're gonna see how many lines of molecules do we have and we have 4695 which matches this number here for 695 so we wanted to check that all of the rows are coming in the molecule SMI file ok and so we're gonna perform the descriptor calculation by running bash paddle each okay so maybe you're wondering what is inside the paddle file let's have a look paddle dot Sh so it contains the instruction so we're gonna use Java and then we're gonna use 1 gigabyte of memory and we're gonna use this option because we don't have a display on the Google code lab so we're gonna use the java.awt headless equal to true and then we're gonna specify the dash jar function because we're gonna use the pedo descriptor dot jar file and then here we're gonna remove the salt and the salts are the sodium and the chloride which are in the chemical structure and so this program will automatically remove all salts and also small organic acid from the chemical structure so if that sounded gibberish then so it essentially means that we are cleaning the chemical structure so that there are no impurity okay and here are the other options pertaining to how we also clean the chemical structure and then this option tells the program that we're gonna compute the molecular fingerprint and the fingerprint type will be pop chem fingerprint okay and so finally it will output the descriptors into the file called descriptors output CSV and let's run in some molecules are taking five seconds some are 0.6 okay somewhere feel very quick 0.3 because we have 4695 it's going to take you about 18 to 19 minutes to complete so why don't we just stop this and then we can download directly the computed file which is here descriptors output CSV so why don't you head over to the data professor github click on the data repository and then you will see this page and then scroll down and find the scripters output dot CSV click on it and then right click on the download button and save link as and then download it into your computer so we're gonna save it into the desktop and I'm gonna change the save as type to be all files and I'm gonna change txt to be CSV so it's automatically opening it for me okay so we're gonna go back to this notebook and then as you see it's barely up to about 200 so I'm gonna stop this if it allows me to alright so it stopped so let's see if there is any generated file so it has generated some output and so I'm going to delete this because I'm gonna upload the completed version so here I'm going to click on the upload and then this is the desktop and I'm gonna click on the descriptors output dot CSV okay and it's currently uploading wait one moment so let's list the files so okay so it's uploading notice the file science is increasing that's a good sign wait one moment it's fairly big file almost hang in there alright so it's finished so let's have a look again so it's about eight point three megabytes and so we're gonna read the descriptors output into DF three and their score EPS and so here we're gonna prepare the X&Y data matrices and the FS data matrix will comprise of the molecular descriptors which are the Pupkin fingerprints so let's have a look and so we're gonna delete the first column here the name because we want only the molecular features so let's drop it using the dot drop function and the name of the column that we want it to be dropped and then we see that the name column has been dropped and then we reassign it back to DF three underscore X alright so now let's create the Y data matrix and here we're gonna take the PhD 50 column directly from the f3 data frame which is the initially loaded data frame and then we're gonna assign it to the DF 3 underscore Y and then here we're gonna combine x and y together so this really depends but for portability we're gonna combine it and then we're gonna output it into a CSV file which we will be uploading to the github and then we're going to use that for part okay and so here we're gonna output this data set three data frame into a CSV file and notice that the name here is fairly long and the purpose for having such a long name is to allow less confusion and to allow us to easily see what is the purpose of this file and so the first segment of this name is the name of the target protein which is the acetylcholine esterase zero six is just the sequential order number and so bioactivity data three class PhD fifty essentially tells us that it contains the bio activity data information along with three categorical class comprising of active inactive and intermediate and it also contains the PSC fifty values and in the last segment here is pop chem FP signifying that it contains the Pupkin fingerprints so this will become handy when we have more than one fingerprint type and so paddle allows you to compute more than ten different fingerprint types and so let's make it your homework to try to compute other different fingerprint types so you want to play around with the options and see what other molecular fingerprints are available and then you could rename the file accordingly all right so let's see if we have already written out the file and it's right here okay and then we're gonna write out the file so let's run the code cell here and let's save it into our computer all right so it's finished and it's gonna open up for us to see all right let's see so these are the fingerprints of pop chem and the last column is the P ic50 so we're gonna use this file to perform model building in the next episode so please stay tuned to that so support this channel by smashing the like button subscribe if you haven't yet done so and click on the notification bell in order to be notified of the next video and if you have come this far in the video please give yourself a big clap and comment down below that you have watched until the end and big kudos to you guys and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videos\n"