Call for Participation in the Open Bioinformatics Research Project

**Open Crowdsourcing Research Project: Computational Drug Discovery with Beta-Lactamase Dataset**

---

### Introduction to the Project

About three months ago, a call was made in a community post asking if anyone would be interested in joining an open crowdsourcing research paper. Out of 555 votes, approximately 84 participants expressed their interest, and over 100 individuals shared their intention to join this initiative. Today marks the day when the original dataset is being shared with the community. This project focuses on computational drug discovery, specifically on molecules that have been experimentally tested to bind or not bind to a protein called beta-lactamase.

The dataset has already been downloaded from the database and shared as a zip file on GitHub under the repository name "beta-lactamase." The goal is to perform a quick exploratory data analysis (EDA) of this dataset and explore ways to push the project forward. This initiative aims to create one of the first open datasets announced on YouTube, inviting contributions from machine learning experts, statisticians, biologists, and chemists.

---

### Accessing the Dataset

To get started with the dataset, follow these steps:

1. **Download the Zip File**: The zip file containing 136 CSV files is named "betalactomy's channel29.sip" and can be found in the GitHub repository.

2. **Unzip the File**: The file size is approximately 1.35 megabytes, and once unzipped, it will contain 136 CSV files.

The dataset consists of 71,973 rows across nine columns, including molecule sample IDs, canonical SMILES notation (a one-dimensional representation of chemical structures), target protein names, and bioactivity values. The SMILES notation can be converted into molecular fingerprints using tools like Paddle or Padel-PI for further analysis.

---

### Exploring the Dataset

The dataset includes the following key columns:

1. **Molecule Sample ID**: A unique identifier for each molecule.

2. **Canonical SMILES**: A string representation of the chemical structure of the molecule.

3. **Target Preference Name**: The name of the protein (e.g., beta-lactamase).

4. **BAO Label**: Describes the assay format used to generate the data.

5. **Standard Relation**: Indicates whether the bioactivity value is finite or greater than a certain threshold.

6. **Standard Value**: The actual bioactivity value, which can be in units like IC50, Ki, or pChamber.

The pChamber value combines IC50 and Ki values into a single column after applying a negative logarithmic transformation. This value will serve as the target variable (y) for machine learning models, with molecular fingerprints serving as features (x).

---

### Handling Missing Data and Duplicates

The dataset contains 71,973 rows, with approximately 64,000 containing valid pChamber values. Some molecules have duplicate entries with slight variations in their bioactivity values. To address this:

1. **Aggregate Duplicate Entries**: Group the data by molecule sample ID to identify duplicates.

2. **Filter Based on Standard Deviation**: Retain only those molecules where the standard deviation of the pChamber value is less than two, ensuring consistency in the dataset.

This preprocessing step ensures that the final dataset is clean and ready for analysis.

---

### Exploring Target Proteins

The dataset contains data from multiple target proteins, with one protein (beta-lactamase) accounting for nearly 80% of the entries. This imbalance raises interesting questions about how to handle such distributions when building predictive models. Potential approaches include:

1. **Building Separate Models**: Creating individual models for each target protein.

2. **Creating Unified Models**: Developing a single model that accounts for all target proteins, known as a proteo-chemometric model.

These considerations highlight the importance of careful data stratification and modeling strategies.

---

### Performing Exploratory Data Analysis (EDA)

The dataset provides ample opportunities for EDA. Some suggested analyses include:

1. **Stratifying by Bioactivity Classes**: Categorizing molecules into inactive, active, or intermediate based on pChamber values.

2. **Comparing Molecular Properties**: Analyzing molecular weights, solubility, and other properties across different bioactivity classes.

3. **Visualizing Distributions**: Creating histograms of pChamber values to understand their distribution.

Using tools like Paddle or Padel-PI, you can convert SMILES notation into molecular fingerprints and explore these properties in depth.

---

### Building Machine Learning Models

The dataset is ideal for building machine learning models to predict bioactivity. Potential approaches include:

1. **Regression Models**: Predicting pChamber values using molecular descriptors.

2. **Classification Models**: Classifying molecules as active, inactive, or intermediate (multi-class classification) or simply binary classification (active vs. inactive).

Innovative approaches like converting molecules into graphs for deep learning or representing them as text for LSTM-based models can also be explored.

---

### Contributing to the Project

Contributions are welcome from anyone interested in participating in this open research project. You can:

1. **Share Your Work**: Upload your Jupiter notebooks, code, and models as pull requests (PRs) on GitHub.

2. **Collaborate on Writing the Paper**: Discuss methods for co-authoring the paper using tools like Google Docs, Overleaf, or GitHub.

3. **Provide Feedback**: Share ideas and feedback in the comments section or via PRs.

The dataset will also be shared on Kaggle, offering another platform for contributions.

---

### Conclusion

This open crowdsourcing research project invites participants to contribute their expertise in machine learning, statistics, biology, or chemistry. By working together, the community can explore innovative approaches to predictive modeling and advance computational drug discovery. Whether you're a seasoned researcher or just starting out, your contribution can make a meaningful impact on this initiative.

---

### Links and Resources

- **GitHub Repository**: [beta-lactamase](https://github.com/username/beta-lactamase)

- **Related Video on Computational Drug Discovery 101**: [Link to Video]

Thank you for your interest in this project. Your creativity and contributions are essential to its success!

"WEBVTTKind: captionsLanguage: enso about three months ago i dropped the question in the community post whether any of you would be interested in joining an open crowdsourcing research paper with me and so out of 555 votes about 84 of you express your interest and over a hundred of you have posted your intention or your interest in joining this particular open crowd sourcing research paper initiative with me where i'll be sharing with you an original data set and so today is the day that we're going to share the original data sets and so the data set is going to be in the domain of computational drug discovery particularly we're going to go to this database and then we're going to download the data set or several data sets and the data set will be for molecules that have been experimentally tested to be able to bind or not bind to this protein called the beta lactamase and so i have already downloaded all of the data set from this database and i've shared this as a zip file on the github called beta lactamase which is the name of the repo and then today we're going to perform a quick eda of this particular data set and then i'm going to provide some general ideas on how you could help to push this particular project forward and so this is going to be probably one of the first open data set to be announced on youtube and i'm very excited to have all of you participate and contribute your machine learning statistics or any other expertise that you could bring to this project and so i'm not sure whether there's going to be a lot of people participating but if you're interested in sharing your jupiter notebook your code your models please feel free to do a pr and i'll try my best to read all of them and those who contributed the most will be considered as the co-author of this particular research project and then finally we're going to write a paper together and so for those of you who are biologists or chemists you could also pitch in so i'll have to figure out again how we're going to write the paper together whether on google docs on overleaf or even right inside github or you could also provide some ideas in the comments down below and so let's get started and have a look at the data set so go to this jupiter notebook so i'm going to upload it into the github repo shortly after i record this particular tutorial and so let's start by downloading the zip file called betalactomy's channel29.sip which is in the github repo right here and so we're using wgat to download it and then after downloading it we're going to unzip it so it's about 1.35 megabytes in size and now we're going to unzip the file all right and so let's see how many files there are so let me count it and i'm going to pipe it into the wc command and so we have 136 csv files and so remember this code so i post this particular code in twitter here and so it's actually coming from this particular jupiter notebook and so it receives quite a lot of attention here and so it's just one way of repurposing this particular content and so let's run this so what this code will do is that it's going to use the zip file library in order to retrieve all of the csv files that are inside the zip file and so it will perform a query of the list of the csv files inside the zip file and then in a for loop we're going to one by one retrieve the name from all of the zip file contents that are csv and then it's going to open it and it's going to use the pd.read csv in pandas in order to read it and then for each of the data frames that are created from the 136 csv file we're going to use pd.concat in order to combine it into a single data frame and so let's do it oh see what happens okay i have to rename the file name here let's do it again all right and so as you can see it read in all of the contents of the zip file and let's display the data frame which is stored in the df variable so this is the particular data set that we're going to work on together as a community so you can see here that there are 71 973 rows spanning nine columns and so let's have a look at each of the columns here we have the molecule sample id so this column is the unique identifier of the molecule so each molecule will have a unique campbell id and then it's going to have a corresponding canonical smile so this is a one-dimensional representation of the chemical information particularly the chemical structure so it is a one-dimensional representation of the chemical structure and you could also check out a prior video that i've made where you could convert the smiles notation into 2d molecular fingerprint representation which is essentially a molecular descriptor and so molecular descriptors are quantitative or qualitative representation of each molecule and so we are able to convert each molecule by taking the smiles and then using a program called paddle or paddle pi which is a python library which wraps the paddle program and so what it will do is that it's going to take the smiles notation here and it's going to convert that into a set of molecular fingerprints which we can use to build a machine learning model in order to predict the bioactivity okay so i'm going to tell you that in just a moment which column is the bioactivity let's have a look further and so the standard relation column is essentially telling us whether the data that we have got from the original research paper whether they reported it as a finite value or whether it is a greater than a particular value like for example in some papers they might report a value of greater than ten thousand because ten thousand is the maximum limit from which the equipment can measure and so they cannot obtain a finite value therefore they say greater than ten thousand so normally in a typical research project we would delete those values because we're not confident which exact value they are okay so if you would like to keep the rows you might want to keep only those that have a equal sign okay and so this is the bioactivity value the standard value okay so the standard value here the values that you see is the bioactivity value and the bioactivity value will be represented in different types of bioactivity types okay so here you can see that there's the kcat km there's the inhibition and if you look further in here in this column you're going to see 50 or ki okay so typically we could decide on whether to use one of these bioactivity types and normally i would use ic50 or ki and in most situations the number of compounds containing ic50 values are usually the greatest and so therefore i would just use ic50 because it normally accounts for the most data however in order to maximize the usage of all of the data it might be a great idea to use the p chamber value and so the p chamber value normally it comes from the ic50 and ki okay so it combines the value of ic50 and ki into this particular column and also apply a negative logarithmic transformation to it and so values will be converted into the log scale and so for the machine learning model building you're going to use this as your y variable okay so if you want to build the machine learning model which is going to be a part of this research project then you want to take the canonical smiles you want to calculate the molecular fingerprints or molecular descriptor using a program such as paddle or pedal pi and then given those molecular descriptors you're going to use all of the descriptor as the x variables and then for the y variable you're going to use the p chamber value okay so that is one scenario you're going to use the p chamber value as the y variable or you could also compare it to another model where you use the bioactivity type to be ic50 but then you're going to use the standard unit okay so you might need to filter i see 50 values and then you're going to use the standard unit of nanomolar here and then you're going to use the standard value that are displayed here for the standard type of ic50 okay so that we could compare the models between predicting p chamber value as the y or predicting the ic50 values okay and so in the column here target preference name this is the name of the protein okay so these are the names of the protein and the bao label it means the bio activity ontology label and so you can see here that it essentially describes the assay format of this particular data and so you can see that this particular data was obtained from an assay format or this was obtained from a single protein format and so when you're building a model you might want to separate the data on the basis of the different target proteins so all of the data here comprising of 71 000 rolls it's going to be a mixture of different types of target protein which i'm going to show you in just a moment and so it might be a great idea to filter out the unique protein type and build a individual model for the different target protein however there is also another approach where you could theoretically use all of the data from all target proteins and build a single unified predictive model and so that predictive model is called a proteo chemometric model and actually i've talked about that in my two hour and a half long video about computational drug discovery 101 and so i'm going to provide you the link to that lecture video in the video description okay so let's continue further let's have a look at the bar plot where i created the representation of the missing data or the non-missing data for the p chamber column so as i mentioned earlier that we're going to use the p chamber column as the y variable okay so we want to see how many of the rows contain missing values and how many are usable and so we could see that out of 71 000 about 64 000 contains values okay so it is right here so 64 000 contains the p chamber values and so you're going to notice that when we display only the p chamber value you're going to see the ic50 and also the ki okay however you're going to note here that some data entry are redundant meaning that they are duplicates so here you can see that there's molecule chamber 777 and also chamber 777 here and it you're gonna see that there's a discrepancy in the value right where one of the chamber 777 has a standard value or ic50 value of 9 and it also has a value of 70. however if you convert it into the log scale then the value differ slightly between 8.05 and 7.16 but still there is some variants that are present in these two data entry and so what we would normally do is we're going to have to aggregate the molecule chamber id so that they become unique and then we're going to remove the chamber id which contain a standard deviation above two and so if you calculate the standard deviation value for the p chamber value here for molecule chamber seven seven seven if the standard deviation value is greater than two you could delete this particular molecule from the data sets okay however if the standard deviation is less than two we're going to keep it but then we're going to merge it into a single row okay so if it has a duplicate here but you're gonna see that it has many duplicates right there's one two three four okay so if the standard deviation is less than two we're going to keep molecule chamber 777 however we are going to merge it into a single row and then in order to do that we could calculate the mean value or the average of that and so you could calculate the average of the p chamber if it has a standard deviation less than two but if it has a standard deviation greater than 2 we're going to delete the entire chamber 777 from our data set okay so let's proceed further so as i mentioned previously you could aggregate the data by performing this group by function right and then it's going to group the data by the molecule symbol id and so this will become unique and so in order to figure out which one contains multiple values meaning that which one has redundant values you could also calculate the standard deviation and if you see that it has a standard deviation it means that there are duplicates in here for this particular tempo id but if it's nan it means that there is no duplicate and so if there's no duplicate we keep it but if they are duplicate we're going to apply the less than two standard deviation as the cutoff and so after you have applied the standard deviation of two has to cut off to keep the redundant molecule but then a single copy of that you're going to merge that into your entire data set where there is a single value like for this one when you see that there's a nand value here there's a single value so there's unique and so we're going to keep that anyway and so once you merge the unique value from the redundant value if that's not confusing to you then we're going to merge that both data together okay so these are just some testing code that i've created um so feel free to modify this okay we're gonna skip over to this one to run it so for this one i've just printed out the number of unique chamber id versus the total number of temple id and so here you can see that the first printed value is the number of unique chamber id and so for this one we selected the molecule temple id and then we applied the unique function to it and for this one the total number of chamber id we applied the link function to the df and so we got a total of 71 000 rows and then we also tested whether there are any missing jumbo id and the result says no there's no missing data here all right let's see what is this one okay so i guess i just tested the code and so you could feel free to delete this one as well to run this code here right i think i wanted to make a plot comparing the unique and the redundant data so please feel free to help me complete this particular code cell and if you want you could help to do a pr to the github repo alright and so earlier on i talked about the different target proteins that are in this particular data set and so if we do a top 50 you can see that the x-axis lists the name of the target protein and the y-axis lists the number of rows for each target protein so you're going to see that the first one called beta-lactamines mc it contains more than 60 000 data samples and so recall that there are 71 000 data samples originally and so this one accounted for nearly 80 roughly of the entire data set and so it might also be a good idea to compare this particular data set with the rest okay however there are other issues to consider like for example the first target protein contains nearly 80 of the entire data set whereas the others contain a smaller portion so it might be interesting to investigate how this imbalance could be handled like for example you might have a unique approach that could handle such imbalance of the data sets and so i would love to see all of the creative solutions that you come up with right so let's have a look at this one for this one we printed out the top bioactivity units so we're going to visualize it so you're going to see that roughly 60 000 contains data as the potency and only a couple thousand here contains the ic50 value and the ki and the kd okay and so this is from the bioactivity units column let's have a look here for the bio assay anthology the bao label column so roughly yeah 60 000 comes from the assay format so i'm guessing that this one comes from mc the one with the most data right so 60 thousand corresponding and also here assay format potency and app c it's probably the same data set okay so look here what is this and so for this one we've created a histogram for the p chamber column and so you can see the distribution of the values all right and so the challenge here is to create a jupiter notebook where you perform an exploratory data analysis of this data set and what would be useful is you could apply the paddle software or the paddle pi library in python and i've actually created a video about that so you could find a link in the video description and so you could convert the smiles notation into molecular fingerprints and then you could perform exploratory data analysis on that and some ideas that you could explore is to do stratification of the data set by the bioactivity class meaning that you take the p chamber value and then you apply a threshold like for example let me jot it down here so if you take the p chamber values and then you could convert this into the qualitative label like for example if the p chamber is less than five you could call this to be you call it as an inactive molecule however if the p chamber values is greater than six you're going to call it active molecule which means that it has good activity however if they are between five and six you're going to call it intermediate and so you could perform exploratory data analysis by comparing the three groups here inactive active and intermediates and for each of the groups here you could compare the molecular fingerprints that are created you could also compare the distribution of the histogram of the page chamber values you could compare the molecular weights for each of the group you might be able to discover whether active molecule prefers molecule that are big or molecules that are small or whether inactive molecule are big or small or you could also explore other parameters of the molecule whether the solubility is polar or a polar for the active or inactive molecule or even the intermediate as well and so please feel free to use your creativity in performing the exploratory data analysis and you could create a jupiter notebook and upload it to the github you could share the link to me as well in the comments down below you could also do a pr or a pool request and share your code or solution or comments so i might also upload this particular data set to kaggle and i'll provide you the link to that also in the video description and if you would like to you could also contribute the jupiter notebook over at kaggle as well and aside from doing eda analysis you could also build machine learning models predicting the p chamber value you could make a regression model predicting the p chamber value or you could make a classification model classifying whether the molecule will be inactive active or intermediate in a multi-class classification or it could be a binary classification where you compare between active and inactive okay and so aside from using paddle as the molecular descriptors you could also explore the use of other representation of the molecule for example you could convert it into a molecular graph and then you could apply deep learning to build the model you could make it into a image and then you could build a convolutional neural network you could also represent the molecule as a string of text and then you could use long short term memory lstm to build the model and so as you can see there's so many approaches that you could do with this particular data set and i would love to see all of the solutions that you guys and gals come up with and we'll see how far we can get this open research project going and so i might also create some videos in the future to showcase some of the solutions that you come up with which might also be some guidelines or ideas or inspiration for other participants of this open research project as well and so finally when we write this paper and get this into a manuscript form contributors who contribute the most to this project will be considered as a co-author however it's going to be merit-based and i hope that this will be a fun initiative and if you like it and you would like to help to promote this please give it a like please share the video and subscribe to the channel for future updates and videos and as always the best way to learn data science is to do data science and please enjoy the journeyso about three months ago i dropped the question in the community post whether any of you would be interested in joining an open crowdsourcing research paper with me and so out of 555 votes about 84 of you express your interest and over a hundred of you have posted your intention or your interest in joining this particular open crowd sourcing research paper initiative with me where i'll be sharing with you an original data set and so today is the day that we're going to share the original data sets and so the data set is going to be in the domain of computational drug discovery particularly we're going to go to this database and then we're going to download the data set or several data sets and the data set will be for molecules that have been experimentally tested to be able to bind or not bind to this protein called the beta lactamase and so i have already downloaded all of the data set from this database and i've shared this as a zip file on the github called beta lactamase which is the name of the repo and then today we're going to perform a quick eda of this particular data set and then i'm going to provide some general ideas on how you could help to push this particular project forward and so this is going to be probably one of the first open data set to be announced on youtube and i'm very excited to have all of you participate and contribute your machine learning statistics or any other expertise that you could bring to this project and so i'm not sure whether there's going to be a lot of people participating but if you're interested in sharing your jupiter notebook your code your models please feel free to do a pr and i'll try my best to read all of them and those who contributed the most will be considered as the co-author of this particular research project and then finally we're going to write a paper together and so for those of you who are biologists or chemists you could also pitch in so i'll have to figure out again how we're going to write the paper together whether on google docs on overleaf or even right inside github or you could also provide some ideas in the comments down below and so let's get started and have a look at the data set so go to this jupiter notebook so i'm going to upload it into the github repo shortly after i record this particular tutorial and so let's start by downloading the zip file called betalactomy's channel29.sip which is in the github repo right here and so we're using wgat to download it and then after downloading it we're going to unzip it so it's about 1.35 megabytes in size and now we're going to unzip the file all right and so let's see how many files there are so let me count it and i'm going to pipe it into the wc command and so we have 136 csv files and so remember this code so i post this particular code in twitter here and so it's actually coming from this particular jupiter notebook and so it receives quite a lot of attention here and so it's just one way of repurposing this particular content and so let's run this so what this code will do is that it's going to use the zip file library in order to retrieve all of the csv files that are inside the zip file and so it will perform a query of the list of the csv files inside the zip file and then in a for loop we're going to one by one retrieve the name from all of the zip file contents that are csv and then it's going to open it and it's going to use the pd.read csv in pandas in order to read it and then for each of the data frames that are created from the 136 csv file we're going to use pd.concat in order to combine it into a single data frame and so let's do it oh see what happens okay i have to rename the file name here let's do it again all right and so as you can see it read in all of the contents of the zip file and let's display the data frame which is stored in the df variable so this is the particular data set that we're going to work on together as a community so you can see here that there are 71 973 rows spanning nine columns and so let's have a look at each of the columns here we have the molecule sample id so this column is the unique identifier of the molecule so each molecule will have a unique campbell id and then it's going to have a corresponding canonical smile so this is a one-dimensional representation of the chemical information particularly the chemical structure so it is a one-dimensional representation of the chemical structure and you could also check out a prior video that i've made where you could convert the smiles notation into 2d molecular fingerprint representation which is essentially a molecular descriptor and so molecular descriptors are quantitative or qualitative representation of each molecule and so we are able to convert each molecule by taking the smiles and then using a program called paddle or paddle pi which is a python library which wraps the paddle program and so what it will do is that it's going to take the smiles notation here and it's going to convert that into a set of molecular fingerprints which we can use to build a machine learning model in order to predict the bioactivity okay so i'm going to tell you that in just a moment which column is the bioactivity let's have a look further and so the standard relation column is essentially telling us whether the data that we have got from the original research paper whether they reported it as a finite value or whether it is a greater than a particular value like for example in some papers they might report a value of greater than ten thousand because ten thousand is the maximum limit from which the equipment can measure and so they cannot obtain a finite value therefore they say greater than ten thousand so normally in a typical research project we would delete those values because we're not confident which exact value they are okay so if you would like to keep the rows you might want to keep only those that have a equal sign okay and so this is the bioactivity value the standard value okay so the standard value here the values that you see is the bioactivity value and the bioactivity value will be represented in different types of bioactivity types okay so here you can see that there's the kcat km there's the inhibition and if you look further in here in this column you're going to see 50 or ki okay so typically we could decide on whether to use one of these bioactivity types and normally i would use ic50 or ki and in most situations the number of compounds containing ic50 values are usually the greatest and so therefore i would just use ic50 because it normally accounts for the most data however in order to maximize the usage of all of the data it might be a great idea to use the p chamber value and so the p chamber value normally it comes from the ic50 and ki okay so it combines the value of ic50 and ki into this particular column and also apply a negative logarithmic transformation to it and so values will be converted into the log scale and so for the machine learning model building you're going to use this as your y variable okay so if you want to build the machine learning model which is going to be a part of this research project then you want to take the canonical smiles you want to calculate the molecular fingerprints or molecular descriptor using a program such as paddle or pedal pi and then given those molecular descriptors you're going to use all of the descriptor as the x variables and then for the y variable you're going to use the p chamber value okay so that is one scenario you're going to use the p chamber value as the y variable or you could also compare it to another model where you use the bioactivity type to be ic50 but then you're going to use the standard unit okay so you might need to filter i see 50 values and then you're going to use the standard unit of nanomolar here and then you're going to use the standard value that are displayed here for the standard type of ic50 okay so that we could compare the models between predicting p chamber value as the y or predicting the ic50 values okay and so in the column here target preference name this is the name of the protein okay so these are the names of the protein and the bao label it means the bio activity ontology label and so you can see here that it essentially describes the assay format of this particular data and so you can see that this particular data was obtained from an assay format or this was obtained from a single protein format and so when you're building a model you might want to separate the data on the basis of the different target proteins so all of the data here comprising of 71 000 rolls it's going to be a mixture of different types of target protein which i'm going to show you in just a moment and so it might be a great idea to filter out the unique protein type and build a individual model for the different target protein however there is also another approach where you could theoretically use all of the data from all target proteins and build a single unified predictive model and so that predictive model is called a proteo chemometric model and actually i've talked about that in my two hour and a half long video about computational drug discovery 101 and so i'm going to provide you the link to that lecture video in the video description okay so let's continue further let's have a look at the bar plot where i created the representation of the missing data or the non-missing data for the p chamber column so as i mentioned earlier that we're going to use the p chamber column as the y variable okay so we want to see how many of the rows contain missing values and how many are usable and so we could see that out of 71 000 about 64 000 contains values okay so it is right here so 64 000 contains the p chamber values and so you're going to notice that when we display only the p chamber value you're going to see the ic50 and also the ki okay however you're going to note here that some data entry are redundant meaning that they are duplicates so here you can see that there's molecule chamber 777 and also chamber 777 here and it you're gonna see that there's a discrepancy in the value right where one of the chamber 777 has a standard value or ic50 value of 9 and it also has a value of 70. however if you convert it into the log scale then the value differ slightly between 8.05 and 7.16 but still there is some variants that are present in these two data entry and so what we would normally do is we're going to have to aggregate the molecule chamber id so that they become unique and then we're going to remove the chamber id which contain a standard deviation above two and so if you calculate the standard deviation value for the p chamber value here for molecule chamber seven seven seven if the standard deviation value is greater than two you could delete this particular molecule from the data sets okay however if the standard deviation is less than two we're going to keep it but then we're going to merge it into a single row okay so if it has a duplicate here but you're gonna see that it has many duplicates right there's one two three four okay so if the standard deviation is less than two we're going to keep molecule chamber 777 however we are going to merge it into a single row and then in order to do that we could calculate the mean value or the average of that and so you could calculate the average of the p chamber if it has a standard deviation less than two but if it has a standard deviation greater than 2 we're going to delete the entire chamber 777 from our data set okay so let's proceed further so as i mentioned previously you could aggregate the data by performing this group by function right and then it's going to group the data by the molecule symbol id and so this will become unique and so in order to figure out which one contains multiple values meaning that which one has redundant values you could also calculate the standard deviation and if you see that it has a standard deviation it means that there are duplicates in here for this particular tempo id but if it's nan it means that there is no duplicate and so if there's no duplicate we keep it but if they are duplicate we're going to apply the less than two standard deviation as the cutoff and so after you have applied the standard deviation of two has to cut off to keep the redundant molecule but then a single copy of that you're going to merge that into your entire data set where there is a single value like for this one when you see that there's a nand value here there's a single value so there's unique and so we're going to keep that anyway and so once you merge the unique value from the redundant value if that's not confusing to you then we're going to merge that both data together okay so these are just some testing code that i've created um so feel free to modify this okay we're gonna skip over to this one to run it so for this one i've just printed out the number of unique chamber id versus the total number of temple id and so here you can see that the first printed value is the number of unique chamber id and so for this one we selected the molecule temple id and then we applied the unique function to it and for this one the total number of chamber id we applied the link function to the df and so we got a total of 71 000 rows and then we also tested whether there are any missing jumbo id and the result says no there's no missing data here all right let's see what is this one okay so i guess i just tested the code and so you could feel free to delete this one as well to run this code here right i think i wanted to make a plot comparing the unique and the redundant data so please feel free to help me complete this particular code cell and if you want you could help to do a pr to the github repo alright and so earlier on i talked about the different target proteins that are in this particular data set and so if we do a top 50 you can see that the x-axis lists the name of the target protein and the y-axis lists the number of rows for each target protein so you're going to see that the first one called beta-lactamines mc it contains more than 60 000 data samples and so recall that there are 71 000 data samples originally and so this one accounted for nearly 80 roughly of the entire data set and so it might also be a good idea to compare this particular data set with the rest okay however there are other issues to consider like for example the first target protein contains nearly 80 of the entire data set whereas the others contain a smaller portion so it might be interesting to investigate how this imbalance could be handled like for example you might have a unique approach that could handle such imbalance of the data sets and so i would love to see all of the creative solutions that you come up with right so let's have a look at this one for this one we printed out the top bioactivity units so we're going to visualize it so you're going to see that roughly 60 000 contains data as the potency and only a couple thousand here contains the ic50 value and the ki and the kd okay and so this is from the bioactivity units column let's have a look here for the bio assay anthology the bao label column so roughly yeah 60 000 comes from the assay format so i'm guessing that this one comes from mc the one with the most data right so 60 thousand corresponding and also here assay format potency and app c it's probably the same data set okay so look here what is this and so for this one we've created a histogram for the p chamber column and so you can see the distribution of the values all right and so the challenge here is to create a jupiter notebook where you perform an exploratory data analysis of this data set and what would be useful is you could apply the paddle software or the paddle pi library in python and i've actually created a video about that so you could find a link in the video description and so you could convert the smiles notation into molecular fingerprints and then you could perform exploratory data analysis on that and some ideas that you could explore is to do stratification of the data set by the bioactivity class meaning that you take the p chamber value and then you apply a threshold like for example let me jot it down here so if you take the p chamber values and then you could convert this into the qualitative label like for example if the p chamber is less than five you could call this to be you call it as an inactive molecule however if the p chamber values is greater than six you're going to call it active molecule which means that it has good activity however if they are between five and six you're going to call it intermediate and so you could perform exploratory data analysis by comparing the three groups here inactive active and intermediates and for each of the groups here you could compare the molecular fingerprints that are created you could also compare the distribution of the histogram of the page chamber values you could compare the molecular weights for each of the group you might be able to discover whether active molecule prefers molecule that are big or molecules that are small or whether inactive molecule are big or small or you could also explore other parameters of the molecule whether the solubility is polar or a polar for the active or inactive molecule or even the intermediate as well and so please feel free to use your creativity in performing the exploratory data analysis and you could create a jupiter notebook and upload it to the github you could share the link to me as well in the comments down below you could also do a pr or a pool request and share your code or solution or comments so i might also upload this particular data set to kaggle and i'll provide you the link to that also in the video description and if you would like to you could also contribute the jupiter notebook over at kaggle as well and aside from doing eda analysis you could also build machine learning models predicting the p chamber value you could make a regression model predicting the p chamber value or you could make a classification model classifying whether the molecule will be inactive active or intermediate in a multi-class classification or it could be a binary classification where you compare between active and inactive okay and so aside from using paddle as the molecular descriptors you could also explore the use of other representation of the molecule for example you could convert it into a molecular graph and then you could apply deep learning to build the model you could make it into a image and then you could build a convolutional neural network you could also represent the molecule as a string of text and then you could use long short term memory lstm to build the model and so as you can see there's so many approaches that you could do with this particular data set and i would love to see all of the solutions that you guys and gals come up with and we'll see how far we can get this open research project going and so i might also create some videos in the future to showcase some of the solutions that you come up with which might also be some guidelines or ideas or inspiration for other participants of this open research project as well and so finally when we write this paper and get this into a manuscript form contributors who contribute the most to this project will be considered as a co-author however it's going to be merit-based and i hope that this will be a fun initiative and if you like it and you would like to help to promote this please give it a like please share the video and subscribe to the channel for future updates and videos and as always the best way to learn data science is to do data science and please enjoy the journey\n"