Building a Bioinformatics Web App in Python _ Streamlit #7

Building a Simple Bioinformatics Web Application for Drug Discovery using Streamlit and Google Colab

In this article, we will explore how to build a simple bioinformatics web application for drug discovery using Streamlit and Google Colab. We will cover the various components of the application, including data preprocessing, molecular descriptor calculation, model building, and prediction.

First, let's start with the code snippet that demonstrates the application:

```python

# Import necessary libraries

import pandas as pd

from rdkit import Chem

from generate import aromatic_proportion, generate

# Load the pre-built model from Google Colab

model = pickle.load(open('solubility_model.pkl', 'rb'))

# Define a function to compute molecular descriptors

def compute_descriptors(smiles):

mol = Chem.MolFromSmiles(smiles)

descriptors = []

for atom in mol.GetAtoms():

if atom.GetFuncGroup() == Chem.AllChem.GetAtomMolProps(atom)[0]:

descriptors.append(atom.GetSymbol())

return descriptors

# Define a function to generate smiles notation

def generate_smiles(smiles):

# Split the smiles notation into individual lines

lines = smiles.split('\\n')

smiles_list = []

for line in lines:

smiles_list.append(line)

return smiles_list

# Load the dataset from a file

df = pd.read_csv('data.csv')

# Preprocess the data by computing molecular descriptors and generating smiles notation

df['descriptors'] = df['smiles'].apply(compute_descriptors)

df['smiles'] = df['smiles'].apply(generate_smiles)

# Define a function to make predictions using the loaded model

def predict(descriptors):

x = pd.DataFrame({'descriptors': descriptors})

y_pred = model.predict(x)

return y_pred

# Create the Streamlit web application

import streamlit as st

st.title('Bioinformatics Web Application for Drug Discovery')

# Input field to enter smiles notation

smiles_input = st.text_area('Enter smiles notation', height=200)

# Button to submit the input and make predictions

if st.button('Submit'):

# Compute molecular descriptors and generate smiles notation

descriptors = compute_descriptors(smiles_input)

smiles_list = generate_smiles(smiles_input)

# Make predictions using the loaded model

y_pred = predict(descriptors)

# Display the results

st.write('Predicted log s values:')

for i, pred in enumerate(y_pred):

st.write(f'Compound {i+1}: {pred:.4f}')

# Display a message if no input is entered

if not smiles_input:

st.write('Please enter some smiles notation to make predictions.')

```

The application consists of several components:

1. **Data Preprocessing**: The code defines two functions, `compute_descriptors` and `generate_smiles`, which compute molecular descriptors and generate smiles notation from the input smiles notation. These functions are applied to each molecule in the dataset using the pandas library.

2. **Model Building**: The application loads a pre-built model from Google Colab using the pickle.load function. This model is a trained machine learning model that takes molecular descriptors as input and predicts log s values.

3. **Prediction**: The application defines a function, `predict`, which takes the computed molecular descriptors as input and makes predictions using the loaded model. This function returns the predicted log s values for each molecule in the dataset.

4. **Streamlit Web Application**: The application creates a Streamlit web interface with an input field to enter smiles notation, a button to submit the input, and a section to display the results.

When the user enters some smiles notation and clicks the "Submit" button, the application computes molecular descriptors and generates smiles notation for each molecule. It then makes predictions using the loaded model and displays the predicted log s values for each compound.

The application can be used to predict log s values for a wide range of compounds, including small molecules, peptides, and proteins. The pre-built model is trained on a large dataset of molecular descriptors and can make accurate predictions for many different types of molecules.

One of the advantages of this application is that it can handle multiple lines of smiles notation, which makes it easy to input complex molecules with multiple rings or functional groups. Additionally, the application can be modified to use different machine learning models or to incorporate additional features, such as molecular weight or molar p values.

Overall, this application demonstrates how bioinformatics and machine learning can be combined to predict log s values for a wide range of compounds. It provides a simple and user-friendly interface for users to input molecules and view the predicted log s values, making it a useful tool for researchers in the field of drug discovery.

"WEBVTTKind: captionsLanguage: enwelcome back to the data professor youtube channel my name is shannon mod and i'm an associate professor of bioinformatics in this video i'm going to show you how you can build a bioinformatics web application and without further ado we're starting right now so in prior episodes i have shown you how you could use the streamlet library in python to build simple web application ranging from a simple financial web application where you could check the stock price a simple web application where you could predict the boston housing price a penguin species prediction web application and so in today's episode we're going to talk about how you could build a simple bioinformatics web application and it's going to be based on the prior tutorial videos that are mentioned in this channel so the bioinformatics web application that we're going to be building today will be an extension of a tutorial series where i have shown you how you could build a molecular solubility prediction model using machine learning where particularly we are applying machine learning and python to the field of computational drug discovery and if you think of it in the grand scheme of things it is part of the bioinformatics research area and so this video will focus more on the aspect of actually building the web application and if you're interested in how to build the prediction model on the molecular solubility that we will be using today let me refer you to the prior tutorial videos on this channel and the links will be provided in the video description and also the pen comments of this video okay and so let's get started shall we so we're going to go to the streamlet directory and then it's going to be located in the solubility folder and so all of these are the actual files that we're going to be using today and so actually i think i have the prediction model on google colab let me download that okay it's right here and so this will be a concise version of the contents of a prior video where i have shown you how you could build machine learning models of the molecular solubility data set and so as i mentioned already i will provide the links to this video and so let's connect to this and let's have a look so let me clear all of the outputs and then it's already connected and so we're going to import pandas as pd we're going to be downloading the calculated descriptors directly from the github of the data professor and so this has already been computed as mentioned in the prior video and so these are the computed descriptors moloch p mole weight number of rotatable bonds aromatic proportion so these four are the computed descriptor from the prior video that i have mentioned and the log s is the y variable that we will be computing or predicting so these four variables are the x variables okay and so we're going to separate this data frame into two sets of variables one is the x variable so we'll be dropping the log s column and then we're going to have the last column as the index number is indicating here minus one which is log s okay so the x variables will be containing all of the columns except for log s and the y variable will be containing only the log s okay and so we're going to be building the linear regression model here and so we're going to import the linear model from sklearn and we're going to also compute the performance metric by importing mean squared error and the r2 score function from the sklearn.metric so let's run that and then we're going to run the linear regression and then we're going to assign this to model and then we're going to perform the model.fit using the x variables in order to predict the y so we will be building a training model here x not defined okay so i haven't yet run this so let me do that first all right so let's do the model building okay so the model has already been built and then we're going to be performing the prediction all right prediction has been made and it is assigned to the y pred variable and the prediction is made using the model dot predict and then using x as the input argument and then we're going to print the model performance here and these four values are the regression coefficient values for each of the four input variables of the x comprising of molar p molecular weight number of rotatable bonds aromatic proportion so the regression coefficient will represent the magnitude that each of these four variables are contributing to the prediction of the y so the greater the magnitude the greater the influence of that particular variable and so the y-intercept is right here the mean squared error 1.01 the r-squared value 0.77 and then let's print out the model equation okay so this is the equation and then we're going to visualize the scatter plot of the actual versus the predicted okay so let's run the scatter plot here all right so this is the experimental versus the predicted log s value so as we have seen here from the r squared we're getting pretty good correlation between the experimental and the predicted log s values and finally we're going to pick up the object here so essentially we're going to save the model into a file called solubility model.pkl and so we're gonna import this into the streamlit web application and so we have already run it and then we could simply click on this corner here and then click on the download file okay and it's essentially the same thing that we have right here but it should be noted that the version of scikit-learn on google collab and on the local version that i have installed will be different and it will be giving some warning values so it might be worthwhile to copy all of the code here into a file and then run it locally so that it's going to be using the same version of the scikit-learn okay so let's think of that as a homework for you guys and let's get started in building our web application now so let's head over to the streamlight folder solubility and let's have a look at the contents here so the delani solubility here so these are the input descriptors comprising of the x and y variables that we have been using just a moment ago to build the prediction model so i'm going to provide you these files in the github that is dedicated to this tutorial and also the jupiter notebook that we have seen just a moment ago for building the prediction model of this solubility data set and so that jupiter notebook will be using this input file here but also in the code it is downloading directly from the github of the data professor so actually we don't actually need this as well so i i can just delete that and then i'm just going to provide you the julian notebook and okay so a total of three files will be used for this web application so the first one is the logo so this is the logo that we will be using for the web application so i have drawn this in the ipad using the good notes application so here it is called molecular solubility prediction app so given input molecule the machine learning model will be predicting the log s value okay so let's fire up the atom editor let's have a look at the contents of solubility app dot py okay so let's see so there's a total of about 110 lines of code and so notice that i have also included several lines of code that are essentially the comments here and so these are just for ease of reading having a look at what each block of codes are doing all right so if deleting the comments it will be probably just under 100 lines of code so let's take a look at the code here okay so the first block of code that we will be using are essentially importing libraries that are needed here and so we'll be importing the numpy and so numpy will be used for the descriptor calculation we'll be using pandas because we need to read in the data set and actually we will be using it to prepare the data frames as well in the generate code here the function that we use for computing the molecular descriptors so full detail as i have mentioned will be provided in a prior video that i have shown you in a step-by-step manner and the hero of this tutorial is the streamlet library so this is the basis of this web application and we'll be importing pickle in order to allow us to save the machine learning model and then importing it in to this web application and then we're going to be using the image function from pil in order to display the logo that i have shown you just a moment ago and here we'll be using the chem and descriptors function of the rd kits will be for computing the molecular descriptors and the molecular descriptors are essentially allowing us to describe molecules in terms of their physical chemical properties and so for ease of usage we have already created the custom function for calculating the molecular descriptors and they are provided here from lines here 12 which are the comments until lines 57 and so we will be having two custom function aromatic proportion and the generate function and let's take a look at the web application so let me run the web application right now conda activate dp you don't need to do this if you have already installed all of the libraries up for your python directly in the command line but if you use conda then you want to activate the environment that is dedicated for running your data science projects so i highly recommend installing conda and then creating a specialized environment for your data science projects so that will help us to maintain the library dependencies otherwise if you have several projects on your computer and then when you upgrade one library it might downgrade other libraries or other dependencies as a result and then it might make some of your prior data science project not workable so i highly recommend to use conda and specialized environment for managing your data science project okay so we have already activated dp here and we're going to go to the directories streamlets solubility okay so we have three files here and we're going to run it streamlit run solubility app dot py okay and now this is the web application okay so the left hand side here is the input parameters which is the smiles input will represent the chemical information of the input molecule so each line represents a different molecule so we have three lines here as an example and you could replace this with your own data and then we will be able to predict the value of the solubility as a function of the input smiles notation here so let me repeat this again this portion the smiles notation here each line represents a single molecule so you're seeing here three lines so it represents three molecules and each smiles notation will tell us what is the atomic composition of the molecule and so here we see that the first line of code here cccc it has five carbon atoms and the second one has three carbon atoms and third one is a carbon atom and a nitrogen atom okay and we can even search for smiles of a molecule of our interest and let's search for a molecule of our interest let's search for aspirin and let's click on it and let's find smiles i'm going to search for it command f and then search for smiles and we have it here in 2.1.4 canonical smiles so we're going to copy that so this is the smiles notation and so i'm going to copy and then i'm going to paste it here and then after we paste it here we have to press command and enter in order to apply this and then note that the predicted value here will be updated so let's try it command enter because i'm on a mac so i have to do command enter and so on windows is probably ctrl enter and so here we have the input smiles notation which is right here the same thing on the left and then below here we have the computed molecular descriptors which are the four physical chemical properties here comprising of the molar p molecular weight number of rotatable bonds aromatic proportion so these four variables are computed using the rd kids library let me go back here rd kit right here rd kit rd kit okay so using the chem and descriptors function and using the custom function that we have written to compute the aromatic proportion which is right here aromatic proportion and also the mole log p which is the first column here molecular weight which is the second column here number of rotatable bonds which is right here and the aromatic proportion is a custom function because rd kit does not compute that property and so we will have to create our own function in order to compute the aromatic proportion and so why did we use these four variables here how did we know that we have to compute these four descriptors it's because the original work by john dilani he used these four variables for building his prediction model so if you would like for more information please have a look at this original research paper and so here finally we have the predicted log s value and we have it predicted to be minus 2.09 so this is the relative solubility value okay so there we have it a simple web application for predicting the molecular solubility values so let's have a look further into the code so image this line of code here 63 line 63 is this image it is for displaying the image by essentially assigning it opening it and then assigning it to the image variable and then using the st.image in order to actually display the image and then use column width to be true in order to expand image to fit the column width and then here we're going to write out molecular solubility prediction app right here we can even modify this to be like web web app save it and then rerun it and then you're going to see that the name is updated to be web app and then these are the descriptors so this is in markdown format so markdown is allowing us to format text to have links italic or bold face as well for example if we want to make the text bold we're going to be using double asterisk before and after okay for highlighting solubility lock s we're going to add the double asterisk before and after if we want this to be bold and also in italic we need three of them click on always rerun so you see that it is in italic okay and also bold and what if we have only one italic what happens here so a single asterisk will make it in a italic form and if we have two it becomes bold okay and if we want to add links to it then we have the bracket here to define the boundaries of the text that we want to be linked and then in parenthesis we're going to add the link the url here and so you can click on this and it will take us to the original paper okay and this is the original paper it was published in 2004 okay so this block of code here will be reading in the input features which is the smiles notation so sc.sidebar header will be the header here the heading user input features and then read smiles input so this is the example input that we will be using and notice here that we use 5c let's rerun this so 5c is right here and then notice that when we have a backslash n it becomes a new line and then we have three c which is right here and then we have a backslash n which is a new line and then it becomes cn okay which is the third line here okay ccc here ctc here and then cn and cn okay and this is the input text of the smiles notation okay so the text box that we see here comes from this text area function and the input argument include the smiles input which is the name of this right here and then the smiles underscore input will be the example smiles notation that we have used here so if we change this we change the first one from c to n and then this will be changed right it becomes n now okay and then we're going to add a dummy first item here in order to allow us to simply read in multiple lines of smiles notation and then later we're going to be skipping that dummy first item and so we're going to do the same for multiple times on line number 91 96 and also when we make the prediction so this will allow us to simply create a data frame of the smiles notation and also for generating the data frame of the computed molecular descriptors for a single input parameter so imagine if we have only a single molecule okay this works okay but if we don't have the dummy items let me show you which is at three places lines number 87 96 110. so let's save it so without the dummy let's see what happens here okay and when we try to make a prediction we get an error right here let's you okay so this is having okay it's skipping here okay i have to hide this one too 91 as well so actually it is 87 91 96 and also 110. okay let's rerun it so this will work right oh okay so i think we're skipping oh yeah we cannot okay we have to display x here otherwise nothing will be shown and then we have to display smiles here otherwise nothing will be shown see nothing is shown here and then we need to show the prediction so we need to type in prediction okay so instead of slicing from the second value onwards we're going to have it printing all of it okay now it should work all right it works now so notice that it works if it has multiple lines for the smiles notation here but what if it has only one okay only one command enter and now we have an error here okay so for ease of handling a single input parameters we're going to use the dummy item as i have mentioned right here we're going to be reading the second value onwards okay now it works right we have a single value here and we make a prediction it works okay all right let's continue so i have already mentioned that we have the input smiles notation here and the computed molecular descriptors is provided here in this block of code so very simple here so all of the hard work is done here in the custom function and so we're going to be using only the generate because generate will be using the aromatic proportion right here aromatic proportion function and then we're going to be using the generate function right here line number 95 and then the input argument will be the smiles and the smiles here will be what it will be the input smiles that we have okay so these will be the input right so it will be split according to the backslash n so each line of code here each line of the smiles notation will represent a unique molecule okay and then each molecule will be computed for its molecular descriptors and then it will be shown here in the x variable so if we recall y equal to f of x therefore x will be the input parameters for predicting the y value which is the log s okay and so the model has already been pre-built in the google collab just a moment ago and we are loading in the model here using the pickle.load function and the solubility model dot p k l and then we're going to make use of this loaded model here load model dot predict and then using the input x which we have already computed here right here right so there's three major component so the input smiles here is here these are the example right but if we change this to something else like aspirin right then this whole thing will be here we'll be here okay we'll be like that okay and then it will enter here and then it will be subjected to descriptor calculation using the generate function and then finally we have the x variables and then the x variable here will be used as the input argument when we want to make a prediction okay and so finally here we're going to be using the computed molecular descriptors that are contained within the x variables and then we're going to use the loaded model that we have already predicted which we have already built on the google code lab and then we save it out using the pico object and then we're loading it back in into this streamlit web application and then we're going to be applying this built model here load model dot predict and then the input argument will be x which are the computed molecular descriptors and finally here st dot header is corresponding to right here predicted lock s values and then the actual prediction will be shown right here prediction and it's going to be showing right here okay so we can have multiple lines let's say that we have multiple on input files so here let's say that we have multiple molecules let's change this to something else so we are going to get a different molecule so we're going to see that the prediction value also changes right if we want to change a carbon to the nitrogen what will happen here right so the log s value changes here and so you're going to see that it influences the molecular weight and also the molar p but not the number of rotatable bonds and the aromatic proportion so changing a carbon to a it's essentially changing one atom and so it will be influencing two descriptors here all right so congratulations you have now built a very simple bioinformatics web application particularly for drug discovery and so if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videoswelcome back to the data professor youtube channel my name is shannon mod and i'm an associate professor of bioinformatics in this video i'm going to show you how you can build a bioinformatics web application and without further ado we're starting right now so in prior episodes i have shown you how you could use the streamlet library in python to build simple web application ranging from a simple financial web application where you could check the stock price a simple web application where you could predict the boston housing price a penguin species prediction web application and so in today's episode we're going to talk about how you could build a simple bioinformatics web application and it's going to be based on the prior tutorial videos that are mentioned in this channel so the bioinformatics web application that we're going to be building today will be an extension of a tutorial series where i have shown you how you could build a molecular solubility prediction model using machine learning where particularly we are applying machine learning and python to the field of computational drug discovery and if you think of it in the grand scheme of things it is part of the bioinformatics research area and so this video will focus more on the aspect of actually building the web application and if you're interested in how to build the prediction model on the molecular solubility that we will be using today let me refer you to the prior tutorial videos on this channel and the links will be provided in the video description and also the pen comments of this video okay and so let's get started shall we so we're going to go to the streamlet directory and then it's going to be located in the solubility folder and so all of these are the actual files that we're going to be using today and so actually i think i have the prediction model on google colab let me download that okay it's right here and so this will be a concise version of the contents of a prior video where i have shown you how you could build machine learning models of the molecular solubility data set and so as i mentioned already i will provide the links to this video and so let's connect to this and let's have a look so let me clear all of the outputs and then it's already connected and so we're going to import pandas as pd we're going to be downloading the calculated descriptors directly from the github of the data professor and so this has already been computed as mentioned in the prior video and so these are the computed descriptors moloch p mole weight number of rotatable bonds aromatic proportion so these four are the computed descriptor from the prior video that i have mentioned and the log s is the y variable that we will be computing or predicting so these four variables are the x variables okay and so we're going to separate this data frame into two sets of variables one is the x variable so we'll be dropping the log s column and then we're going to have the last column as the index number is indicating here minus one which is log s okay so the x variables will be containing all of the columns except for log s and the y variable will be containing only the log s okay and so we're going to be building the linear regression model here and so we're going to import the linear model from sklearn and we're going to also compute the performance metric by importing mean squared error and the r2 score function from the sklearn.metric so let's run that and then we're going to run the linear regression and then we're going to assign this to model and then we're going to perform the model.fit using the x variables in order to predict the y so we will be building a training model here x not defined okay so i haven't yet run this so let me do that first all right so let's do the model building okay so the model has already been built and then we're going to be performing the prediction all right prediction has been made and it is assigned to the y pred variable and the prediction is made using the model dot predict and then using x as the input argument and then we're going to print the model performance here and these four values are the regression coefficient values for each of the four input variables of the x comprising of molar p molecular weight number of rotatable bonds aromatic proportion so the regression coefficient will represent the magnitude that each of these four variables are contributing to the prediction of the y so the greater the magnitude the greater the influence of that particular variable and so the y-intercept is right here the mean squared error 1.01 the r-squared value 0.77 and then let's print out the model equation okay so this is the equation and then we're going to visualize the scatter plot of the actual versus the predicted okay so let's run the scatter plot here all right so this is the experimental versus the predicted log s value so as we have seen here from the r squared we're getting pretty good correlation between the experimental and the predicted log s values and finally we're going to pick up the object here so essentially we're going to save the model into a file called solubility model.pkl and so we're gonna import this into the streamlit web application and so we have already run it and then we could simply click on this corner here and then click on the download file okay and it's essentially the same thing that we have right here but it should be noted that the version of scikit-learn on google collab and on the local version that i have installed will be different and it will be giving some warning values so it might be worthwhile to copy all of the code here into a file and then run it locally so that it's going to be using the same version of the scikit-learn okay so let's think of that as a homework for you guys and let's get started in building our web application now so let's head over to the streamlight folder solubility and let's have a look at the contents here so the delani solubility here so these are the input descriptors comprising of the x and y variables that we have been using just a moment ago to build the prediction model so i'm going to provide you these files in the github that is dedicated to this tutorial and also the jupiter notebook that we have seen just a moment ago for building the prediction model of this solubility data set and so that jupiter notebook will be using this input file here but also in the code it is downloading directly from the github of the data professor so actually we don't actually need this as well so i i can just delete that and then i'm just going to provide you the julian notebook and okay so a total of three files will be used for this web application so the first one is the logo so this is the logo that we will be using for the web application so i have drawn this in the ipad using the good notes application so here it is called molecular solubility prediction app so given input molecule the machine learning model will be predicting the log s value okay so let's fire up the atom editor let's have a look at the contents of solubility app dot py okay so let's see so there's a total of about 110 lines of code and so notice that i have also included several lines of code that are essentially the comments here and so these are just for ease of reading having a look at what each block of codes are doing all right so if deleting the comments it will be probably just under 100 lines of code so let's take a look at the code here okay so the first block of code that we will be using are essentially importing libraries that are needed here and so we'll be importing the numpy and so numpy will be used for the descriptor calculation we'll be using pandas because we need to read in the data set and actually we will be using it to prepare the data frames as well in the generate code here the function that we use for computing the molecular descriptors so full detail as i have mentioned will be provided in a prior video that i have shown you in a step-by-step manner and the hero of this tutorial is the streamlet library so this is the basis of this web application and we'll be importing pickle in order to allow us to save the machine learning model and then importing it in to this web application and then we're going to be using the image function from pil in order to display the logo that i have shown you just a moment ago and here we'll be using the chem and descriptors function of the rd kits will be for computing the molecular descriptors and the molecular descriptors are essentially allowing us to describe molecules in terms of their physical chemical properties and so for ease of usage we have already created the custom function for calculating the molecular descriptors and they are provided here from lines here 12 which are the comments until lines 57 and so we will be having two custom function aromatic proportion and the generate function and let's take a look at the web application so let me run the web application right now conda activate dp you don't need to do this if you have already installed all of the libraries up for your python directly in the command line but if you use conda then you want to activate the environment that is dedicated for running your data science projects so i highly recommend installing conda and then creating a specialized environment for your data science projects so that will help us to maintain the library dependencies otherwise if you have several projects on your computer and then when you upgrade one library it might downgrade other libraries or other dependencies as a result and then it might make some of your prior data science project not workable so i highly recommend to use conda and specialized environment for managing your data science project okay so we have already activated dp here and we're going to go to the directories streamlets solubility okay so we have three files here and we're going to run it streamlit run solubility app dot py okay and now this is the web application okay so the left hand side here is the input parameters which is the smiles input will represent the chemical information of the input molecule so each line represents a different molecule so we have three lines here as an example and you could replace this with your own data and then we will be able to predict the value of the solubility as a function of the input smiles notation here so let me repeat this again this portion the smiles notation here each line represents a single molecule so you're seeing here three lines so it represents three molecules and each smiles notation will tell us what is the atomic composition of the molecule and so here we see that the first line of code here cccc it has five carbon atoms and the second one has three carbon atoms and third one is a carbon atom and a nitrogen atom okay and we can even search for smiles of a molecule of our interest and let's search for a molecule of our interest let's search for aspirin and let's click on it and let's find smiles i'm going to search for it command f and then search for smiles and we have it here in 2.1.4 canonical smiles so we're going to copy that so this is the smiles notation and so i'm going to copy and then i'm going to paste it here and then after we paste it here we have to press command and enter in order to apply this and then note that the predicted value here will be updated so let's try it command enter because i'm on a mac so i have to do command enter and so on windows is probably ctrl enter and so here we have the input smiles notation which is right here the same thing on the left and then below here we have the computed molecular descriptors which are the four physical chemical properties here comprising of the molar p molecular weight number of rotatable bonds aromatic proportion so these four variables are computed using the rd kids library let me go back here rd kit right here rd kit rd kit okay so using the chem and descriptors function and using the custom function that we have written to compute the aromatic proportion which is right here aromatic proportion and also the mole log p which is the first column here molecular weight which is the second column here number of rotatable bonds which is right here and the aromatic proportion is a custom function because rd kit does not compute that property and so we will have to create our own function in order to compute the aromatic proportion and so why did we use these four variables here how did we know that we have to compute these four descriptors it's because the original work by john dilani he used these four variables for building his prediction model so if you would like for more information please have a look at this original research paper and so here finally we have the predicted log s value and we have it predicted to be minus 2.09 so this is the relative solubility value okay so there we have it a simple web application for predicting the molecular solubility values so let's have a look further into the code so image this line of code here 63 line 63 is this image it is for displaying the image by essentially assigning it opening it and then assigning it to the image variable and then using the st.image in order to actually display the image and then use column width to be true in order to expand image to fit the column width and then here we're going to write out molecular solubility prediction app right here we can even modify this to be like web web app save it and then rerun it and then you're going to see that the name is updated to be web app and then these are the descriptors so this is in markdown format so markdown is allowing us to format text to have links italic or bold face as well for example if we want to make the text bold we're going to be using double asterisk before and after okay for highlighting solubility lock s we're going to add the double asterisk before and after if we want this to be bold and also in italic we need three of them click on always rerun so you see that it is in italic okay and also bold and what if we have only one italic what happens here so a single asterisk will make it in a italic form and if we have two it becomes bold okay and if we want to add links to it then we have the bracket here to define the boundaries of the text that we want to be linked and then in parenthesis we're going to add the link the url here and so you can click on this and it will take us to the original paper okay and this is the original paper it was published in 2004 okay so this block of code here will be reading in the input features which is the smiles notation so sc.sidebar header will be the header here the heading user input features and then read smiles input so this is the example input that we will be using and notice here that we use 5c let's rerun this so 5c is right here and then notice that when we have a backslash n it becomes a new line and then we have three c which is right here and then we have a backslash n which is a new line and then it becomes cn okay which is the third line here okay ccc here ctc here and then cn and cn okay and this is the input text of the smiles notation okay so the text box that we see here comes from this text area function and the input argument include the smiles input which is the name of this right here and then the smiles underscore input will be the example smiles notation that we have used here so if we change this we change the first one from c to n and then this will be changed right it becomes n now okay and then we're going to add a dummy first item here in order to allow us to simply read in multiple lines of smiles notation and then later we're going to be skipping that dummy first item and so we're going to do the same for multiple times on line number 91 96 and also when we make the prediction so this will allow us to simply create a data frame of the smiles notation and also for generating the data frame of the computed molecular descriptors for a single input parameter so imagine if we have only a single molecule okay this works okay but if we don't have the dummy items let me show you which is at three places lines number 87 96 110. so let's save it so without the dummy let's see what happens here okay and when we try to make a prediction we get an error right here let's you okay so this is having okay it's skipping here okay i have to hide this one too 91 as well so actually it is 87 91 96 and also 110. okay let's rerun it so this will work right oh okay so i think we're skipping oh yeah we cannot okay we have to display x here otherwise nothing will be shown and then we have to display smiles here otherwise nothing will be shown see nothing is shown here and then we need to show the prediction so we need to type in prediction okay so instead of slicing from the second value onwards we're going to have it printing all of it okay now it should work all right it works now so notice that it works if it has multiple lines for the smiles notation here but what if it has only one okay only one command enter and now we have an error here okay so for ease of handling a single input parameters we're going to use the dummy item as i have mentioned right here we're going to be reading the second value onwards okay now it works right we have a single value here and we make a prediction it works okay all right let's continue so i have already mentioned that we have the input smiles notation here and the computed molecular descriptors is provided here in this block of code so very simple here so all of the hard work is done here in the custom function and so we're going to be using only the generate because generate will be using the aromatic proportion right here aromatic proportion function and then we're going to be using the generate function right here line number 95 and then the input argument will be the smiles and the smiles here will be what it will be the input smiles that we have okay so these will be the input right so it will be split according to the backslash n so each line of code here each line of the smiles notation will represent a unique molecule okay and then each molecule will be computed for its molecular descriptors and then it will be shown here in the x variable so if we recall y equal to f of x therefore x will be the input parameters for predicting the y value which is the log s okay and so the model has already been pre-built in the google collab just a moment ago and we are loading in the model here using the pickle.load function and the solubility model dot p k l and then we're going to make use of this loaded model here load model dot predict and then using the input x which we have already computed here right here right so there's three major component so the input smiles here is here these are the example right but if we change this to something else like aspirin right then this whole thing will be here we'll be here okay we'll be like that okay and then it will enter here and then it will be subjected to descriptor calculation using the generate function and then finally we have the x variables and then the x variable here will be used as the input argument when we want to make a prediction okay and so finally here we're going to be using the computed molecular descriptors that are contained within the x variables and then we're going to use the loaded model that we have already predicted which we have already built on the google code lab and then we save it out using the pico object and then we're loading it back in into this streamlit web application and then we're going to be applying this built model here load model dot predict and then the input argument will be x which are the computed molecular descriptors and finally here st dot header is corresponding to right here predicted lock s values and then the actual prediction will be shown right here prediction and it's going to be showing right here okay so we can have multiple lines let's say that we have multiple on input files so here let's say that we have multiple molecules let's change this to something else so we are going to get a different molecule so we're going to see that the prediction value also changes right if we want to change a carbon to the nitrogen what will happen here right so the log s value changes here and so you're going to see that it influences the molecular weight and also the molar p but not the number of rotatable bonds and the aromatic proportion so changing a carbon to a it's essentially changing one atom and so it will be influencing two descriptors here all right so congratulations you have now built a very simple bioinformatics web application particularly for drug discovery and so if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videos\n"