Building a Simple Bioinformatics Web Application for Drug Discovery using Streamlit and Google Colab
In this article, we will explore how to build a simple bioinformatics web application for drug discovery using Streamlit and Google Colab. We will cover the various components of the application, including data preprocessing, molecular descriptor calculation, model building, and prediction.
First, let's start with the code snippet that demonstrates the application:
```python
# Import necessary libraries
import pandas as pd
from rdkit import Chem
from generate import aromatic_proportion, generate
# Load the pre-built model from Google Colab
model = pickle.load(open('solubility_model.pkl', 'rb'))
# Define a function to compute molecular descriptors
def compute_descriptors(smiles):
mol = Chem.MolFromSmiles(smiles)
descriptors = []
for atom in mol.GetAtoms():
if atom.GetFuncGroup() == Chem.AllChem.GetAtomMolProps(atom)[0]:
descriptors.append(atom.GetSymbol())
return descriptors
# Define a function to generate smiles notation
def generate_smiles(smiles):
# Split the smiles notation into individual lines
lines = smiles.split('\\n')
smiles_list = []
for line in lines:
smiles_list.append(line)
return smiles_list
# Load the dataset from a file
df = pd.read_csv('data.csv')
# Preprocess the data by computing molecular descriptors and generating smiles notation
df['descriptors'] = df['smiles'].apply(compute_descriptors)
df['smiles'] = df['smiles'].apply(generate_smiles)
# Define a function to make predictions using the loaded model
def predict(descriptors):
x = pd.DataFrame({'descriptors': descriptors})
y_pred = model.predict(x)
return y_pred
# Create the Streamlit web application
import streamlit as st
st.title('Bioinformatics Web Application for Drug Discovery')
# Input field to enter smiles notation
smiles_input = st.text_area('Enter smiles notation', height=200)
# Button to submit the input and make predictions
if st.button('Submit'):
# Compute molecular descriptors and generate smiles notation
descriptors = compute_descriptors(smiles_input)
smiles_list = generate_smiles(smiles_input)
# Make predictions using the loaded model
y_pred = predict(descriptors)
# Display the results
st.write('Predicted log s values:')
for i, pred in enumerate(y_pred):
st.write(f'Compound {i+1}: {pred:.4f}')
# Display a message if no input is entered
if not smiles_input:
st.write('Please enter some smiles notation to make predictions.')
```
The application consists of several components:
1. **Data Preprocessing**: The code defines two functions, `compute_descriptors` and `generate_smiles`, which compute molecular descriptors and generate smiles notation from the input smiles notation. These functions are applied to each molecule in the dataset using the pandas library.
2. **Model Building**: The application loads a pre-built model from Google Colab using the pickle.load function. This model is a trained machine learning model that takes molecular descriptors as input and predicts log s values.
3. **Prediction**: The application defines a function, `predict`, which takes the computed molecular descriptors as input and makes predictions using the loaded model. This function returns the predicted log s values for each molecule in the dataset.
4. **Streamlit Web Application**: The application creates a Streamlit web interface with an input field to enter smiles notation, a button to submit the input, and a section to display the results.
When the user enters some smiles notation and clicks the "Submit" button, the application computes molecular descriptors and generates smiles notation for each molecule. It then makes predictions using the loaded model and displays the predicted log s values for each compound.
The application can be used to predict log s values for a wide range of compounds, including small molecules, peptides, and proteins. The pre-built model is trained on a large dataset of molecular descriptors and can make accurate predictions for many different types of molecules.
One of the advantages of this application is that it can handle multiple lines of smiles notation, which makes it easy to input complex molecules with multiple rings or functional groups. Additionally, the application can be modified to use different machine learning models or to incorporate additional features, such as molecular weight or molar p values.
Overall, this application demonstrates how bioinformatics and machine learning can be combined to predict log s values for a wide range of compounds. It provides a simple and user-friendly interface for users to input molecules and view the predicted log s values, making it a useful tool for researchers in the field of drug discovery.