How to build a Bioinformatics web app (Molecular Descriptor Calculator) in Python _ Streamlit #21

The Molecular Descriptor Calculator Web Application

So if you see an error such as the one that you saw just a moment ago it means that it cannot find the molecule id or the smiles column so make sure to specify the proper name and these two text box here in order for the web app to work properly. All right, and so in number three will be here the lines number 59 until 74. So it will be the selection of the molecular fingerprints and so as mentioned already we are able to select 12 different fingerprints okay and right here set number of molecules to compute will allow us to first read in the input file okay. So we're going to read it in and then we're going to determine the shape and then we're going to take the first value.

And when we run the shape function on the df0 here which is reading in the csv file we will want to figure out how many rows does it have so the number zero will allow us to retrieve the number of rows in the csv file and so here in this example it has 4,695 rows and then it will be displaying that as a maximum value okay. So all mole here variable here all mo variable will have a value of 4695 and therefore in this slider function we will be specifying the max value to be all mo which is 4695 and the minimum value here is 10 and then we have a step size of 10 okay and we have a default value of 10 okay. So when we load the web app for the first time it will have a value of 10 okay which is default it's right here value of 10.

And if you kind of slide it a bit you could use the arrow key to help you just click on it and then use the arrow key okay in order to select the number of molecules to use and then just upload your data or press on the example data sent here and so on lines 82 until 97 we will have the if condition. And on lines 99 until 117 we will have the else condition.

So line number 82 will evaluate whether we have already uploaded a file the csv file here in number one. Have we uploaded a file if we haven't uploaded a file let me show you it will be displaying the else okay. So if we haven't yet uploaded a file we will be automatically going to the else condition and then the first thing that we're going to be seeing is line number 100 which is st info.

And then it will have a blue box here a waiting for csv file to be uploaded and then you will see a button here the stdot button press to use example data set and so you're gonna see that if it's embedded in else right. And so if the button here is clicked if the buttons is clicked then it will trigger the following statements underneath here.

And so it will be reading in the file the example file it will be formatting it for paddle descriptor calculation and then it will perform the calculation and after performing the calculation it will read in the descriptor and then finally it will provide it as a csv file to be downloaded and printing out the details of the descriptors okay. And on the other hand if we upload the file let me show you here not this one but if we go to streamlets and then we go to the mo desk folder and this example file if we upload it.

And so just a point note that this file and the example file that we use here are the same thing so it has 4695 rolls all right. So i just drag and drop here as you see and then you see the name of the file appearing here and right here. So the selected fingerprint is atom2d and the number of molecules to be calculated since 10 and we have generated 780 fingerprints okay.

And so again if a file is uploaded it will perform the if paragraph here otherwise if there is no csv file uploaded it will be displaying the else condition and you will be seeing the blue box saying that it is awaiting for the csv file to be uploaded. And so this web application will allow those who are interested in bioinformatics to be able to calculate molecular properties for compounds of interest which they could use for drug discovery projects particularly they could use this application to calculate molecular fingerprints of molecule and then use that descriptor or fingerprint to build machine learning models.

In order to predict biological activity of compounds or also the chemical properties as well such as the melting point of a molecule or the boiling point of a molecule based on the chemical structure. And all right and there you have it the molecular descriptor calculator and i hope that this video was helpful to you and please support the channel by smashing the like button subscribing if you haven't already share the video to your friends and don't forget to hit on the notification bell so that you'll be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey

"WEBVTTKind: captionsLanguage: enin this video we're going to be building a bioinformatics related web application particularly we will be building a molecular descriptor application that will allow us to calculate the molecular properties of molecules and afterwards we will be using machine learning to build predictive models that will allow you to predict biological and chemical properties that will be useful for computational drug discovery and other applications and so without further ado we're starting right now the first thing that you want to do is fire up your terminal and then go to the directory that contains the file and i have it in the desktop in the streamlit directory in the modest folder and let's have a look at the contents and so here we can see that we have the paddle folder we have the example csv file that will contain some smiles notation which we will be needing in order to calculate the molecular descriptor and the app.py is the web application let me activate the application so i'll first activate conda and my environment is called data professor and so make sure to activate your own environment and then in order to run the app you will type streamlet run app.py okay and so this is how the application will look like and then you have the sidebar here which will allow you to upload your file allowing you to specify the name of the columns from which we will be retrieving the data from and so here from the input file we will be retrieving two particular columns the molecule sample id and also the canonical smiles so let me show you let me add the modest directory and so here you have the folder which contains the pedal descriptor software and this paddle descriptor folder here is directly downloaded from this link and so the main part of the software will be here the jar file and then the parameter files will be here the xml files so there will be the xml files for each of the molecular descriptor that we want to calculate so in our case here we're going to specify the 12 different types of molecular fingerprints and so if you take a look at the xml file you're going to be seeing here that the value here they're all false and then they're only going to be true for only the descriptor that we want to calculate so in here the pubchem fingerprint file the only value that will be true will be pubchem fingerprints and the rest will have a value of false and same thing for substructure fingerprint count everything will have a value of false except for substructure fingerprint counts which will have a value of true okay and the app.py is the streamlit app that we're going to be showing you today and then the file here with the long name will be the input file and so this file was taken from the bioinformatics from scratch series and this file is also available on the github of data professor in the data folder and so here you're going to be seeing that we're going to be retrieving the two columns which includes the molecule chamber id which will contain the chemical identification number of the molecule and then the canonical smiles notation and so the smile sensation will contain information on the chemical structure okay so these are the smile citation and so they will be describing the chemical structure of each of these molecules so one line here each row or line will represent a molecule a unique molecule a lesser duplicate which we will have to remove and so in today's video i'll be showing you how you could convert each of these entry into molecular fingerprints which we could use for predictive model building okay and so now that we have covered the second part of the sidebar here and the third part will be to allow the user to configure which molecular fingerprint that they want to calculate and so here we have a choice of 12 different fingerprints so we have atom pairs 2d atom pair to the count cdk cdk extended cdk graph only e state click code rock click code account max pubchem substructure and substructure accounts and so they will be accompanied by the corresponding xml file that i have previously talked about and so selecting any of these options here will activate the use or trigger the use of each of these xml files okay and let's have a look if we press on the press to use example data set here we're going to be seeing the application in action and so here upon loading up the example data set we're going to be seeing the first 10 lines here and so you're going to be seeing that the first column is the molecule id second column here is the canonical smiles so this tells us about the chemical structure the third column is the class or the bioactivity class and this is based on the ic50 values or values having actually 50 of less than a thousand will be active a value greater than 10 000 will be inactive and values between a thousand and ten thousand will be considered to be intermediate and so you could actually use this class as a y variable which you could use for multi-class classification and the other columns here are the descriptors pertaining to the lipinski rule of five and so what that essentially mean is they are parameters that will determine whether each of the compound whether they are drug-like or not and psc50 here is the negative logarithmic transformed value from the initial ic50 values and the reason for transforming it into a pic50 or the negative log transformed version is in order to make it have a more uniform distribution okay so from this initial data frame that you see here we're going to be selecting two columns the canonical smiles and the molecule tempo id and notice here that we reorganize a bit the ordering of the columns and so here we call the canonical smiles to be the first column followed by the molecule chamber id and the reason is because the paddle descriptor software will read in the input file in this particular format okay and after writing out this two column data frame into a csv file we will be using it as input to the paddle descriptor software and the paddle descriptor software will be generating the molecular descriptors in our case the molecular fingerprints and so as you will see here we have selected substructure and we have specified here that we want to compute the molecular fingerprint for 10 molecules and so the first 10 molecules will be subjected to molecular fingerprint calculation so you could feel free to modify this value to other values and the descriptors will be adjusted accordingly so here we specify it to be 70 molecule so the first 70 molecules will be used for molecular fingerprint calculation as you will see here and we also specify the details here as well so the selected fingerprint is substructure and we have 70 molecules and we have a total of 307 descriptors and so you could download this as a csv file and then we have it here the generated csv file and then we could subsequently use this file for model building but don't forget to delete the first column or drop the first column from your data frame when you're preparing it for model building using scikit learn and notice here that there is no psd50 included here and so you will also have to specify that the psc 50 will be the y variable so all of this 307 fingerprints here will be the x variable so that will be how you will use it for machine learning model building all right and so now that we have a high level understanding of how this molecular descriptor web application works let's have a look at under the hood what does the code looks like so let's open up the app.py file all right so the first five lines here will be importing the necessary libraries that we will be using so streamlit will allow us to build this particular web application pandas will allow us to display the data frame right here subprocess will allow us to perform the molecular fingerprint calculation using paddle because normally we will have to run it in the bash command line and so here you can see that we're using java because the title descriptor software is a jarl file which is a java file and we specify the parameters here to be using two gigabytes of ram for the calculation and in order to make it run in the command line we use this parameters here and we also specify additional parameters as well such as to remove the salt from the molecular structure to standardize the nitrile functional group and to use the fingerprint and the fingerprint will be specified here meaning the descriptor types will be specified here in percent s and percent s will be here the selected fingerprints so in the drop down here that you see if we select substructure then the substructure.xml file will be put here if we select another one like estate then the estate.xml file will be here so recall earlier in the paddle descriptor folder we have several xml files okay like pubcam substructure or estate so if we select estate then the estate fingerprint dot xml will be selected okay so it will be replacing the percent s right here right here okay okay so we have already discussed this line line number 10 and so lines 8 through 24 is a custom function that will allow us to perform the molecular descriptor calculation and so the precise command that we would normally do when we run it in the command line is to run this okay and so we assign it to the batch command variable and then in order to run this in a python in a process we will be using the subprocess library and then we'll be using the p open function and then we will be specifying the batch command here as a input argument and then we'll be piping out the output into here output and error and after that we will be specifying the header here calculated molecular descriptors which will be here calculated molecular descriptors and then underneath here we will be displaying the calculated descriptor file and so the calculated descriptor file will be outputted by the command here so after running the paddle command here we will be getting as output the descriptors output.csv which is right here descriptors output.csv so this is the name of the output that will be generated after the fingerprint calculation is complete and therefore we will have to read it in and we will be reading it in here and then we assign it to the desk variable and then we will be printing it out using the st.write function okay therefore we get the data frame here and then in order to generate this download csv link we will be using st.markdown function and as input argument we will be using the custom function called file download and then we will be offering the descriptor as the download and then we also specify the option here to display here unsafe allow html to be true so that will allow us to display the html inside the st.markdown function and the file download is a custom function that we specify right here on lines 27 until 31 and so this custom function will be using this base64 library in order to encode and decode the csv data here and making it available as a file download so if we click here you're going to be seeing that we get the csv file so i have already opened for you in just a few minutes ago and here it is again and so this is the computed csv file that i mentioned earlier that you could use for further machine learning model building all right and now let's hop on to line number 34. so here 34 until 46 will be the header of the web app right here lines 35 will be here and one hashtag here that you see will mean that it will have a h1 heading and h1 heading in html means that it will have a very big font size so it's pretty much a header and so it's going to be the same as using st dot header all right and then we then describe the application a bit right here and lines number 37 which corresponds to here and so this app allows you to calculate descriptors of molecules aka also known as molecular descriptors that you could use for computational drug discovery projects such as for the construction of quantitative structure activity relationship qsar or also the quantitative structure property relationship or qspr models and so this particular app will allow you to calculate 12 different molecular fingerprint and the fingerprint list is provided here okay and then we have the credits on lines 31 until 44. so the credit is the app is built here in python and streamlit and the descriptor is calculated using the title descriptor and a full descriptive account of this paddle descriptor software is provided by this research article published in the journal of computational chemistry back in 2011. all right and so let's have a look further on lines 49 until is going to be the sidebar okay so lines 49 will be making the header here number one upload your csv data and notice that i use the width here in order to structure the code in a beautiful format and then followed by the colon and then everything is indented here same thing here in subsequent lines as well so that we know that okay this is part of number one this is part of number two and this block here is part of number three otherwise if it's like not indented and they're all at the same level then it might be a bit difficult to see visually okay and so the first here number one is to allow the user to upload their csv data and we have the uploaded file variable here which will be making use of the file uploader function and here we specify as input argument the string upload your input csv data and the type here we specify it to be csv and then we also provide a link in markdown language example csv data and provided the link to the github of data professor and so it's going to be the data file here that i've shown you a few minutes ago and let me show you again right now it's this file okay so it's the same file that we also provide in this repo okay all right let's proceed further lines 55 until 57 will be right here number two enter column names for molecule id and smiles so as i mentioned from the csv data right here as well or in the example data set we will be reading in from the input file two columns the first column will be molecule chamber id and the second column will be canonical smiles and so let's say that you have your own file and maybe the column name is something different maybe you call it slightly different here maybe you just call it like jumbo id then you'll have to modify the name here so that the web app will be able to recognize that particular column and then select that column for making a subset here formatting it here so that it will be able to format it for subsequent molecular fingerprint calculation okay and so you do the same for the smile as well if you call it something else then feel free to update the name here as well okay and that was 55 until 57 and we'll be making use of a simple text input function in order to display the text box and the first string as input argument will be the label here prior to the text box and the default value will be in the next string the second string is the default value to put here okay so if you don't want to put anything there then you just do it like this you just make it like a quotation okay then it'll be blank but then when it's blank and if you click on example then there's not complete data here and so the app doesn't know what to do it cannot find the molecule id so we'll leave it in okay so that it will work okay so if you see an error such as the one that you saw just a moment ago it means that it cannot find the molecule id or the smiles column so make sure to specify the proper name and these two text box here in order for the web app to work properly all right and so in number three will be here the lines number 59 until 74. so it will be the selection of the molecular fingerprints and so as mentioned already we are able to select 12 different fingerprints okay and right here set number of molecules to compute will allow us to first read in the input file okay so we're going to read it in and then we're going to determine the shape and then we're going to take the first value and so when we run the shape function on the df0 here which is reading in the csv file we will want to figure out how many rows does it have so the number zero will allow us to retrieve the number of rows in the csv file and so here in this example it has 4 695 rows and then it will be displaying that as a maximum value okay so all mole here variable here all mo variable will have a value of 4695 and therefore in this slider function we will be specifying the max value to be all mo which is 4695 and the minimum value here is 10 and then we have a step size of 10 okay and we have a default value of 10 okay so when we load the web app for the first time it will have a value of 10 okay which is default it's right here value of 10. and if you kind of slide it a bit you could use the arrow key to help you just click on it and then use the arrow key okay in order to select the number of molecules to use and then just upload your data or press on the example data sent here and so on lines 82 until 97 we will have the if condition and on lines 99 until 117 we will have the else condition and so line number 82 will evaluate whether we have already uploaded a file the csv file here in number one have we uploaded a file if we haven't uploaded a file let me show you it will be displaying the else okay so if we haven't yet uploaded a file we will be automatically going to the else condition and then the first thing that we're going to be seeing is line number 100 which is st info and then it will have a blue box here a waiting for csv file to be uploaded and then you will see a button here the stdot button press to use example data set and so you're gonna see that if it's embedded in else right and so if the button here is clicked if the buttons is clicked then it will trigger the following statements underneath here and so it will be reading in the file the example file it will be formatting it for paddle descriptor calculation and then it will perform the calculation and after performing the calculation it will read in the descriptor and then finally it will provide it as a csv file to be downloaded and printing out the details of the descriptors okay and on the other hand if we upload the file let me show you here not this one but if we go to streamlets and then we go to the mo desk folder and this example file if we upload it and so just a point note that this file and the example file that we use here are the same thing so it has 4695 rolls all right so i just drag and drop here as you see and then you see the name of the file appearing here and right here so the selected fingerprint is atom2d and the number of molecules to be calculated since 10 and we have generated 780 fingerprints okay and so again if a file is uploaded it will perform the if paragraph here otherwise if there is no csv file uploaded it will be displaying the else condition and you will be seeing the blue box saying that it is awaiting for the csv file to be uploaded and so this web application will allow those who are interested in bioinformatics to be able to calculate molecular properties for compounds of interest which they could use for drug discovery projects particularly they could use this application to calculate molecular fingerprints of molecule and then use that descriptor or fingerprint to build machine learning models in order to predict biological activity of compounds or also the chemical properties as well such as the melting point of a molecule or the boiling point of a molecule based on the chemical structure all right and there you have it the molecular descriptor calculator and i hope that this video was helpful to you and please support the channel by smashing the like button subscribing if you haven't already share the video to your friends and don't forget to hit on the notification bell so that you'll be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journeyin this video we're going to be building a bioinformatics related web application particularly we will be building a molecular descriptor application that will allow us to calculate the molecular properties of molecules and afterwards we will be using machine learning to build predictive models that will allow you to predict biological and chemical properties that will be useful for computational drug discovery and other applications and so without further ado we're starting right now the first thing that you want to do is fire up your terminal and then go to the directory that contains the file and i have it in the desktop in the streamlit directory in the modest folder and let's have a look at the contents and so here we can see that we have the paddle folder we have the example csv file that will contain some smiles notation which we will be needing in order to calculate the molecular descriptor and the app.py is the web application let me activate the application so i'll first activate conda and my environment is called data professor and so make sure to activate your own environment and then in order to run the app you will type streamlet run app.py okay and so this is how the application will look like and then you have the sidebar here which will allow you to upload your file allowing you to specify the name of the columns from which we will be retrieving the data from and so here from the input file we will be retrieving two particular columns the molecule sample id and also the canonical smiles so let me show you let me add the modest directory and so here you have the folder which contains the pedal descriptor software and this paddle descriptor folder here is directly downloaded from this link and so the main part of the software will be here the jar file and then the parameter files will be here the xml files so there will be the xml files for each of the molecular descriptor that we want to calculate so in our case here we're going to specify the 12 different types of molecular fingerprints and so if you take a look at the xml file you're going to be seeing here that the value here they're all false and then they're only going to be true for only the descriptor that we want to calculate so in here the pubchem fingerprint file the only value that will be true will be pubchem fingerprints and the rest will have a value of false and same thing for substructure fingerprint count everything will have a value of false except for substructure fingerprint counts which will have a value of true okay and the app.py is the streamlit app that we're going to be showing you today and then the file here with the long name will be the input file and so this file was taken from the bioinformatics from scratch series and this file is also available on the github of data professor in the data folder and so here you're going to be seeing that we're going to be retrieving the two columns which includes the molecule chamber id which will contain the chemical identification number of the molecule and then the canonical smiles notation and so the smile sensation will contain information on the chemical structure okay so these are the smile citation and so they will be describing the chemical structure of each of these molecules so one line here each row or line will represent a molecule a unique molecule a lesser duplicate which we will have to remove and so in today's video i'll be showing you how you could convert each of these entry into molecular fingerprints which we could use for predictive model building okay and so now that we have covered the second part of the sidebar here and the third part will be to allow the user to configure which molecular fingerprint that they want to calculate and so here we have a choice of 12 different fingerprints so we have atom pairs 2d atom pair to the count cdk cdk extended cdk graph only e state click code rock click code account max pubchem substructure and substructure accounts and so they will be accompanied by the corresponding xml file that i have previously talked about and so selecting any of these options here will activate the use or trigger the use of each of these xml files okay and let's have a look if we press on the press to use example data set here we're going to be seeing the application in action and so here upon loading up the example data set we're going to be seeing the first 10 lines here and so you're going to be seeing that the first column is the molecule id second column here is the canonical smiles so this tells us about the chemical structure the third column is the class or the bioactivity class and this is based on the ic50 values or values having actually 50 of less than a thousand will be active a value greater than 10 000 will be inactive and values between a thousand and ten thousand will be considered to be intermediate and so you could actually use this class as a y variable which you could use for multi-class classification and the other columns here are the descriptors pertaining to the lipinski rule of five and so what that essentially mean is they are parameters that will determine whether each of the compound whether they are drug-like or not and psc50 here is the negative logarithmic transformed value from the initial ic50 values and the reason for transforming it into a pic50 or the negative log transformed version is in order to make it have a more uniform distribution okay so from this initial data frame that you see here we're going to be selecting two columns the canonical smiles and the molecule tempo id and notice here that we reorganize a bit the ordering of the columns and so here we call the canonical smiles to be the first column followed by the molecule chamber id and the reason is because the paddle descriptor software will read in the input file in this particular format okay and after writing out this two column data frame into a csv file we will be using it as input to the paddle descriptor software and the paddle descriptor software will be generating the molecular descriptors in our case the molecular fingerprints and so as you will see here we have selected substructure and we have specified here that we want to compute the molecular fingerprint for 10 molecules and so the first 10 molecules will be subjected to molecular fingerprint calculation so you could feel free to modify this value to other values and the descriptors will be adjusted accordingly so here we specify it to be 70 molecule so the first 70 molecules will be used for molecular fingerprint calculation as you will see here and we also specify the details here as well so the selected fingerprint is substructure and we have 70 molecules and we have a total of 307 descriptors and so you could download this as a csv file and then we have it here the generated csv file and then we could subsequently use this file for model building but don't forget to delete the first column or drop the first column from your data frame when you're preparing it for model building using scikit learn and notice here that there is no psd50 included here and so you will also have to specify that the psc 50 will be the y variable so all of this 307 fingerprints here will be the x variable so that will be how you will use it for machine learning model building all right and so now that we have a high level understanding of how this molecular descriptor web application works let's have a look at under the hood what does the code looks like so let's open up the app.py file all right so the first five lines here will be importing the necessary libraries that we will be using so streamlit will allow us to build this particular web application pandas will allow us to display the data frame right here subprocess will allow us to perform the molecular fingerprint calculation using paddle because normally we will have to run it in the bash command line and so here you can see that we're using java because the title descriptor software is a jarl file which is a java file and we specify the parameters here to be using two gigabytes of ram for the calculation and in order to make it run in the command line we use this parameters here and we also specify additional parameters as well such as to remove the salt from the molecular structure to standardize the nitrile functional group and to use the fingerprint and the fingerprint will be specified here meaning the descriptor types will be specified here in percent s and percent s will be here the selected fingerprints so in the drop down here that you see if we select substructure then the substructure.xml file will be put here if we select another one like estate then the estate.xml file will be here so recall earlier in the paddle descriptor folder we have several xml files okay like pubcam substructure or estate so if we select estate then the estate fingerprint dot xml will be selected okay so it will be replacing the percent s right here right here okay okay so we have already discussed this line line number 10 and so lines 8 through 24 is a custom function that will allow us to perform the molecular descriptor calculation and so the precise command that we would normally do when we run it in the command line is to run this okay and so we assign it to the batch command variable and then in order to run this in a python in a process we will be using the subprocess library and then we'll be using the p open function and then we will be specifying the batch command here as a input argument and then we'll be piping out the output into here output and error and after that we will be specifying the header here calculated molecular descriptors which will be here calculated molecular descriptors and then underneath here we will be displaying the calculated descriptor file and so the calculated descriptor file will be outputted by the command here so after running the paddle command here we will be getting as output the descriptors output.csv which is right here descriptors output.csv so this is the name of the output that will be generated after the fingerprint calculation is complete and therefore we will have to read it in and we will be reading it in here and then we assign it to the desk variable and then we will be printing it out using the st.write function okay therefore we get the data frame here and then in order to generate this download csv link we will be using st.markdown function and as input argument we will be using the custom function called file download and then we will be offering the descriptor as the download and then we also specify the option here to display here unsafe allow html to be true so that will allow us to display the html inside the st.markdown function and the file download is a custom function that we specify right here on lines 27 until 31 and so this custom function will be using this base64 library in order to encode and decode the csv data here and making it available as a file download so if we click here you're going to be seeing that we get the csv file so i have already opened for you in just a few minutes ago and here it is again and so this is the computed csv file that i mentioned earlier that you could use for further machine learning model building all right and now let's hop on to line number 34. so here 34 until 46 will be the header of the web app right here lines 35 will be here and one hashtag here that you see will mean that it will have a h1 heading and h1 heading in html means that it will have a very big font size so it's pretty much a header and so it's going to be the same as using st dot header all right and then we then describe the application a bit right here and lines number 37 which corresponds to here and so this app allows you to calculate descriptors of molecules aka also known as molecular descriptors that you could use for computational drug discovery projects such as for the construction of quantitative structure activity relationship qsar or also the quantitative structure property relationship or qspr models and so this particular app will allow you to calculate 12 different molecular fingerprint and the fingerprint list is provided here okay and then we have the credits on lines 31 until 44. so the credit is the app is built here in python and streamlit and the descriptor is calculated using the title descriptor and a full descriptive account of this paddle descriptor software is provided by this research article published in the journal of computational chemistry back in 2011. all right and so let's have a look further on lines 49 until is going to be the sidebar okay so lines 49 will be making the header here number one upload your csv data and notice that i use the width here in order to structure the code in a beautiful format and then followed by the colon and then everything is indented here same thing here in subsequent lines as well so that we know that okay this is part of number one this is part of number two and this block here is part of number three otherwise if it's like not indented and they're all at the same level then it might be a bit difficult to see visually okay and so the first here number one is to allow the user to upload their csv data and we have the uploaded file variable here which will be making use of the file uploader function and here we specify as input argument the string upload your input csv data and the type here we specify it to be csv and then we also provide a link in markdown language example csv data and provided the link to the github of data professor and so it's going to be the data file here that i've shown you a few minutes ago and let me show you again right now it's this file okay so it's the same file that we also provide in this repo okay all right let's proceed further lines 55 until 57 will be right here number two enter column names for molecule id and smiles so as i mentioned from the csv data right here as well or in the example data set we will be reading in from the input file two columns the first column will be molecule chamber id and the second column will be canonical smiles and so let's say that you have your own file and maybe the column name is something different maybe you call it slightly different here maybe you just call it like jumbo id then you'll have to modify the name here so that the web app will be able to recognize that particular column and then select that column for making a subset here formatting it here so that it will be able to format it for subsequent molecular fingerprint calculation okay and so you do the same for the smile as well if you call it something else then feel free to update the name here as well okay and that was 55 until 57 and we'll be making use of a simple text input function in order to display the text box and the first string as input argument will be the label here prior to the text box and the default value will be in the next string the second string is the default value to put here okay so if you don't want to put anything there then you just do it like this you just make it like a quotation okay then it'll be blank but then when it's blank and if you click on example then there's not complete data here and so the app doesn't know what to do it cannot find the molecule id so we'll leave it in okay so that it will work okay so if you see an error such as the one that you saw just a moment ago it means that it cannot find the molecule id or the smiles column so make sure to specify the proper name and these two text box here in order for the web app to work properly all right and so in number three will be here the lines number 59 until 74. so it will be the selection of the molecular fingerprints and so as mentioned already we are able to select 12 different fingerprints okay and right here set number of molecules to compute will allow us to first read in the input file okay so we're going to read it in and then we're going to determine the shape and then we're going to take the first value and so when we run the shape function on the df0 here which is reading in the csv file we will want to figure out how many rows does it have so the number zero will allow us to retrieve the number of rows in the csv file and so here in this example it has 4 695 rows and then it will be displaying that as a maximum value okay so all mole here variable here all mo variable will have a value of 4695 and therefore in this slider function we will be specifying the max value to be all mo which is 4695 and the minimum value here is 10 and then we have a step size of 10 okay and we have a default value of 10 okay so when we load the web app for the first time it will have a value of 10 okay which is default it's right here value of 10. and if you kind of slide it a bit you could use the arrow key to help you just click on it and then use the arrow key okay in order to select the number of molecules to use and then just upload your data or press on the example data sent here and so on lines 82 until 97 we will have the if condition and on lines 99 until 117 we will have the else condition and so line number 82 will evaluate whether we have already uploaded a file the csv file here in number one have we uploaded a file if we haven't uploaded a file let me show you it will be displaying the else okay so if we haven't yet uploaded a file we will be automatically going to the else condition and then the first thing that we're going to be seeing is line number 100 which is st info and then it will have a blue box here a waiting for csv file to be uploaded and then you will see a button here the stdot button press to use example data set and so you're gonna see that if it's embedded in else right and so if the button here is clicked if the buttons is clicked then it will trigger the following statements underneath here and so it will be reading in the file the example file it will be formatting it for paddle descriptor calculation and then it will perform the calculation and after performing the calculation it will read in the descriptor and then finally it will provide it as a csv file to be downloaded and printing out the details of the descriptors okay and on the other hand if we upload the file let me show you here not this one but if we go to streamlets and then we go to the mo desk folder and this example file if we upload it and so just a point note that this file and the example file that we use here are the same thing so it has 4695 rolls all right so i just drag and drop here as you see and then you see the name of the file appearing here and right here so the selected fingerprint is atom2d and the number of molecules to be calculated since 10 and we have generated 780 fingerprints okay and so again if a file is uploaded it will perform the if paragraph here otherwise if there is no csv file uploaded it will be displaying the else condition and you will be seeing the blue box saying that it is awaiting for the csv file to be uploaded and so this web application will allow those who are interested in bioinformatics to be able to calculate molecular properties for compounds of interest which they could use for drug discovery projects particularly they could use this application to calculate molecular fingerprints of molecule and then use that descriptor or fingerprint to build machine learning models in order to predict biological activity of compounds or also the chemical properties as well such as the melting point of a molecule or the boiling point of a molecule based on the chemical structure all right and there you have it the molecular descriptor calculator and i hope that this video was helpful to you and please support the channel by smashing the like button subscribing if you haven't already share the video to your friends and don't forget to hit on the notification bell so that you'll be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey\n"

How to build a Bioinformatics web app (Molecular Descriptor Calculator) in Python _ Streamlit #21

Random Videos