Becoming a Data Scientist (To PhD or not to PhD)

The Benefits and Challenges of Pursuing a PhD Degree in Biomedical Data Science

As a biomedical data scientist, I can attest that pursuing a PhD degree has been an invaluable experience in my career. At times, I might be involved in two to three research projects simultaneously, either collecting my own dataset or using a dataset collected by a fellow colleague who has done experimental work. This requires domain knowledge to understand the biology behind the data set and selecting appropriate features through feature engineering. The model-building process is also crucial, as it involves understanding how to select the most relevant features for predicting outcomes.

Domain Knowledge: A Crucial Aspect of Pursuing a PhD Degree

One of the key aspects of pursuing a PhD degree in biomedical data science is domain knowledge. This means having a deep understanding of the biology behind the data set, which can be challenging. It requires staying up-to-date with the latest research and developments in the field, as well as being able to analyze complex data sets. Domain knowledge is essential for selecting appropriate features through feature engineering, which involves identifying the most relevant variables that contribute to predicting outcomes.

Post-Model Analysis: Understanding Feature Contributions

After building a model, it's essential to conduct post-model analysis to understand how different features contributed to its performance. This involves examining the results of machine learning algorithms and identifying patterns or relationships between variables. Post-model analysis is crucial for refining models and making them more accurate, as well as understanding why certain features are important.

The Value of Pursuing a PhD Degree

Pursuing a PhD degree in biomedical data science offers numerous benefits, including gaining independence in research projects, selecting appropriate features through feature engineering, and building models that can make predictions. It also provides an opportunity to work with mentors who can provide guidance and support throughout the project. The experience gained from pursuing a PhD degree is equivalent to working on multiple independent projects.

Flexibility and Time Management

One of the most significant advantages of pursuing a PhD degree in biomedical data science is the flexibility it offers. Unlike undergraduate studies, where progress is often batched and scheduled around exams, PhD students have more control over their schedule. This allows them to work independently and manage their time effectively, making it easier to stay motivated and focused.

Accomplishments and Reflections

Throughout my 14-year journey in academia, I've been fortunate to supervise numerous PhD students, master's students, and undergraduate students. Each student pursued an independent research project, which provided me with opportunities to learn from them and expand my knowledge of the field. I'm grateful for the experience and the relationships I've formed during this period.

The Role of Mentors

Mentors play a significant role in the success of PhD students. They provide guidance, support, and accountability throughout the project. Having a mentor can help navigate the challenges of pursuing a PhD degree, such as staying motivated and managing time effectively.

Personal Reflections on Pursuing a PhD Degree

If I hadn't pursued a PhD degree, I'm not sure if I would have met my current adviser who introduced me to the world of data mining and now called data science. I'm forever grateful for the experience and the relationships I've formed during this period. Pursuing a PhD degree has been an invaluable experience that has shaped my career as a biomedical data scientist.

Conclusion

In conclusion, pursuing a PhD degree in biomedical data science offers numerous benefits, including gaining independence in research projects, selecting appropriate features through feature engineering, and building models that can make predictions. While it comes with its own set of challenges, such as staying motivated and managing time effectively, the experience gained from pursuing a PhD degree is equivalent to working on multiple independent projects.

Ultimately, whether or not one benefits from pursuing a PhD degree depends on their unique circumstances and work environment. If you have the opportunity to pursue independent research projects with full control over the data mining or data science project, then the experience gained might be equivalent to completing a PhD degree. Additionally, having a mentor can provide guidance and support throughout the project.

The flexibility offered by pursuing a PhD degree is also significant. Unlike undergraduate studies, where progress is often batched and scheduled around exams, PhD students have more control over their schedule. This allows them to work independently and manage their time effectively, making it easier to stay motivated and focused.

As I look back on my 14-year journey in academia, I'm grateful for the experience and the relationships I've formed during this period. Pursuing a PhD degree has been an invaluable experience that has shaped my career as a biomedical data scientist.

"WEBVTTKind: captionsLanguage: enwelcome back to the data professor YouTube channel if you new here my name is Shannon nontox and Ahmad and I'm an associate professor of bioinformatics on this YouTube channel we cover about data science concepts and practical tutorials so if you're into this type of content please consider subscribing are you thinking of pursuing a PhD degree or the Doctor of Philosophy program or you might be wondering whether the PhD degree is of any benefit to your data science journey or data science career and so in this video I'm going to discuss about that point and so without further ado let's get started so before we begin to talk about the topic of this video so let me bring you back to the time when I was in junior high school when I was in the eighth grade so I participated for the first time in the Science Olympiad so in the Science Olympiad that we participated in we participated at the regional level meeting at the Los Angeles District in California and at the time I participated in two events so the first one was about building a bottle rocket and so the goal of this competition was to use a plastic coke bottle to build a rocket which is filled by water and so you would pump water inside the bottle rocket and at the head of the bottle you would hide a parachute inside and so the release of the high pressure inside the bottle rocket will make it propel up in the air and as it goes down the cap of the rocket will come off revealing the parachute which will expand and so the goal of this competition is to have the bottle rocket stay in the air for as long as possible and so the team with the longest air time would win and so our team came in sixth place and so the second competition was to build a wooden Tower Bridge whereby we will build this tower bridge to be between two table and right at the middle of the tower we're going to tie a string and the string will be tied to a bucket and so we will put in sand inside the bucket and the tower that is able to hold the most we'll win and so our team came in third place and so we got the bronze medal for that event and so it turns out that our team was selected to compete at the state level and so they call this the state finals and so we're representing Los Angeles to compete at the state level which will comprise of other cities in the state of California but unfortunately at the state final we we did not win any competition and so we did not advance further so irregardless of the outcome of this the ambiance of science and the fascination the science has to offer has continued to spark my interest in science and so I returned back to Thailand and I pursued a Bachelors of Science degree with a major in biomedical science and as part of the requirement of the bachelors degree we had to carry out a small thesis project and the thesis project was about studying the diurnal migration patterns of copepods which is kind of like a small shrimp and we're studying the diurnal patterns that these copepods has and so during the daytime they will be hidden in the depth of the water and at night they would rise up to the surface level and thrive and so the results of this research project was funded by our University and we were given the chance to present at an international conference on copy pot and this is all made possible by my major advisor and so for the first time I was given the opportunity to prepare a poster presentation and in doing so it was my first exposure to the scientific community whereby I was given the opportunity to present my research findings to other scientists in the field so these scientists are the ones who published papers in books in prestigious scientific journals that I would read about and I would see them in person at the conference and so being able to talk to the original authors of the paper that I have read was an awesome experience and so after graduating the Bachelors of Science degree I decided to pursue the PhD degree or the Doctor of Philosophy degree program and so did not first started out in data science so the PSD research thesis was about the protein engineering of the green fluorescent protein from the jellyfish and so the goal of the study was to engineer the GFP protein to be able to bind to metal ions and in response to that it will have changes to their fluorescence property so particularly the GFP protein will bind to the metal ion and in response to that the fluorescence intensity of the GFP will decline and so this is called the quenching effect and so in order to create such a engineered GFP protein I have to study about the effects of the mutation particularly the mutations of amino acids and so in nature there are a total of 20 amino acids and we use computer software such as pi mo which is a molecular visualization software to be able to visualize the crystal structure of the GFP protein and select residues that we believe to be responsible for the fluorescence property and so in three-dimension the amino acids are interacting inside with other amino acids and so the mutation of key amino acid will allow us to modulate the fluorescence property and so at this time I have not yet been involved with data science however that part will come soon and so at the end of my first year of my PhD study I had to present my findings at an international conference and it is at this international conference that I met another researcher who has just graduated his PhD degree from the US on data mining and so my PhD advisor was acquainted with this researcher and we had a discussion about my research process eventually we networked and after the conference I met the researcher who later become my co advisor for my research thesis and at the time he gave me this book about neural networks in chemistry and it was for the first time that I have been exposed to the concept of neural network and about data mining and about making predictions and classifications and so this was back in the year of 2004 and so in that year my co adviser has given me this data set up DNA splice Junction site prediction and so essentially the goal of this project was to slide through a DNA sequence and upon reaching the splice Junction site or the site at which the DNA will be cut we will look at the environment of the DNA sequence and so we're gonna have a window size at the middle of the sequence that we're looking at which will have a length of two nucleotide bases and we're gonna have flanking at the left and right to be about three or five nucleotide bases and so this imaginary window will slide through the DNA sequence one by one in order to generate many possible DNA fragments to be predicted and so each sliding of the window will generate one set a feature representation and so we're gonna use machine learning to classify whether the particular DNA fragment is containing the DNA splice Junction sites or not and so the result of that paper was published in the year of 2005 and so it was my first publication on applying data mining in predicting DNA splice Junction sites and so notice that back in 2005 this field was called data mining or another popular term would be knowledge discovery in data and the go-to resource would be the K DD Nuggets which is also an excellent source for data science in this day and age as well and so during the course of the PhD degree I was involved with a total of nine research projects and so for each project it has its own challenges and issues and problems so each of the nine research project is in the domain of bioinformatics and so it is now during the PhD degree starting from my second year of the PhD degree where I was working in the field of bioinformatics and so essentially it is the application of computational approaches including data science data mining on solving biological problems so it was very fun very fascinating and so the fun lies in the opportunity where we are able to apply computational approaches particularly I was using data mining to try to understand and make sense biological data and so we call that during the first year on my PhD study I was working in protein engineering where I engineer a mutant GFP protein and so for the second project of my PhD study I applied the concept of data mining to predict the colors of the GFP proteins and so no one has done this before and so what I did was collect research articles discussing about the engineer of different GFP color variants and I would collect the information about which amino acids were mutated and upon mutation what was the excitation wavelength what is the emission wavelength so the excitation wavelength is essentially you excite the GFP protein by shooting lights to the protein and the protein will accept the light and upon accepting the light it will become excited and then during the excited phase it will emit energy and as a result it will also emit color and so the process of receiving photon of light and emitting energy and then as a result emitting color as well was a fascinating project for which I was involved with whereby I described the GFP protein in terms of the mutated amino acid and I described the amino acids using quantum mechanics or computational chemistry whereby the chemical functionality of the GFP chromophore can be represented in numerical form as a set of molecular descriptors or a set of molecular features so essentially we're transforming the GFP molecule into a set of numerical features that we could use as input to the data mining model and make a prediction into what is the excitation wavelength given the quantum mechanic descriptor and also to predict the emission wavelength given the quantum mechanic descriptor and so this project resulted in about two to three papers and so other research project as part of my PhD thesis was on applying computational chemistry to compute the molecular features of molecule and then apply data mining to predict the antioxidant activity of a chemical library which has the capability of becoming an antioxidant molecule and so during the course of my PhD degree I was involved with a total of nine research projects and each research project requires understanding of the biological domain because we have to figure out which molecular features we should select to model or make a prediction about and it requires reading a lot of literature becoming acquainted with the domain knowledge and the other end to understand computational concepts to apply the proper and appropriate machine learning algorithms to do appropriate data cleaning the equation data pre-processing and so back at the time I was not using any form of programming language it was purely using text editor such as ultraedit or notepad plus plus and also Microsoft Excel SPSS to do these statistical analysis and the weak data mining software to build data mining models such as neural networks decision tree models multiple linear regression support vector machine and so all of this was built using the graphical user interface of weeka and so after using this for quite some time it was becoming quite burdensome to optimize parameters by manually clicking at different step sizes because imagine that for support vector machine let's say that you want to optimize the radial basis function kernel you want to optimize the C parameter and the gamma parameter and so imagine that for the C parameter you want to optimize ten possible values and for the gamma parameter you want to optimize another ten possible values then a 10 by 10 matrix will give you over 100 pairs of parameters to optimize and imagine that each calculation took about one hour or up to 24 hours depending on the size of the data set and so in order to finish the project we have to do several runs like this and so back at the time I was fortunate to have access over the weekend to the computer lab and so over the weekend I would be running calculations using different parameters for the support vector machine or for a neural network calculation or with different seat number and so imagine using 50 computers in the computer lab to run the simulation and after the simulation is complete have to manually copy and paste the results into a text file and then copy that into the thumb drive or the USB Drive and then consolidate all of the information and so I would remember using macro feature of the text editor in order to pre-process the text results and so all of this data mining project was without the use of a programming language such as R or Python because for a biology major back at the time learning programming seems very scary it seems like a formidable task it seems something that is out of reach and so at the time I selected the EC path which might not be easy if we think of it retrospectively of using only the GUI software's and so if I could turn back time I would probably tell my 2005 version to start learning programming language because of the amazing benefits that it has in helping to analyze big data sets or perform automated and programmatic pre-processing of data and model building in a optimal way time efficient way as well and so you can see that during the course of my four years that was spent for the PhD degree all of the nine research project was about data mining and all of the project requires understanding the domain knowledge of biology and trying to translate that into meaningful molecular features that we could use to construct predictive model and after the construction of predictive model we would try to extract knowledge out of the model try to interpret the importance of the feature that influenced the prediction and so once we understand what features are important we will then use that to guide the experimental process in order to engineer the GFP in a different way depending on which molecular feature that we want to control or exert our control over in order to bring about the different colors and so thinking back how did doing a PSD helped in my data science career as a associate professor of bioinformatics and so the PhD degree provided me dedicated time to two nine research projects and so I guess if you're able to do data science projects without pursuing a PhD degree I mean not only to reproduce toy data set but to actually and rigorously be involved in or create the data science project from start to finish and have a firm understanding of all the features involved in the model building and its interpretation afterward and then perhaps it would be equivalent to a PhD degree and so the thing about a PhD degree is not the title that we receive after completing the degree but the most important part of all is the journey and it is this journey that has taught me about project management how to split up the research project into fragments and for each fragment like for example data collection data pre-processing data understanding expiratory data analysis doing due diligence by digging the literature review downloading thousands of research articles scanning the results and discussion looking at all the results table from the paper compiling data sets manually and if thinking retrospectively if I had a knowledge of either R or Python it would have been so much fun but nevertheless it was still fun and so I was manually copying data from the PDF file and then recording the numbers into the Excel so all of the process from beginning to the end of the research project was done by myself starting from reading the scientific literature by downloading thousands of research articles scanning through all of those collecting the data of set manually and after collecting the data set figure out how to represent the features that we want to predict and make the prediction and if the prediction was not good then hit the books read more about the literature to understand what went wrong and what feature were we missing which made the prediction bad or poor and then to figure out which features to calculate so as you can see it's about feature engineering and once we have a optimal feature then we saw your man the increase in the prediction performance like for example in the GFP project and so the initial model provided a correlation coefficient between the experimental and predicted excitation maxima and emission Maxima of about 0.6 and so I hit the books read the research articles and discovered that the GFP chromophore has two anionic states meaning that the structure is different and so what I did was represent the GFP chromophore by two different structure because initially it was represented by one structure and depending on the excitation Maxima because the GFP has two peak one at 395 and one at 408 and so the 395 nanometer was the major peak and 408 was a minor peak and so what I did was depending on whether the GFP chromophore compiled from the paper has excitation maxima at 395 or 408 I would draw the chemical structure differently this time whereby the one at 395 would be given one chemical structure and the one with 408 would be given a different chemical structure and so this tweak in the molecular features or by means of feature engineering inspired by reading the literature led to dramatic increase in the prediction performance as observed by the correlation coefficient between the experimental and the predicted values of excitation and emission Maxima to be in excess of 0.9 and so as you can see the performance increased from 7.6 to 0.9 and so this is not happen in a day it happened over the course of about four months for this research project and so each research project was done project by project so the completion of one project will result in preparing the manuscript for publication and sending it off for publication and then starting a next project and sometimes at one given time I might be involved in two to three research project at the same time either collecting my own dataset or using a dataset collected by a fellow colleague who has done experimental work and then we would use their data set to make that prediction and so we have to have domain knowledge try to understand the biology behind the data set and then try to select appropriate features by performing appropriate feature engineering and in using that for model building and then after the model has been built perform post model analysis try to make sense of the feature and why they contributed to good prediction or bad prediction and so thinking retrospectively did the PhD degree help me as a biomedical data scientist I would definitely say yes but this is only based on my own story and so everyone has their own unique circumstances and story and so in conclusion will you benefit from doing a PhD it really depends on your unique circumstances if your work environment right now is allowing you to pursue independent research projects something that you could have full control over the entire data mining or data science project from data collection to feature engineering to model building model optimization deployment and perhaps writing a technical report of some sort presenting the findings at conferences or to clients and so the experience that you would have gained during the course of these research project or industrial project would probably be equivalent to doing a PhD degree and another good part of doing a PhD degree is the mentor and so the mentor or the PhD advisor has important and significant role in the success of your PhD degree and so the mentor is kind of like the coach of a basketball team or the manager of a baseball team or a trainer so your coach will keep you accountable for the progress of your data science or research project and another thing that I liked about doing the PhD degree was the dedicated time that you have and the flexibility is so immense if you compare it to when you're in undergraduate degree so you don't have to wake up early like the days of undergraduate you could wake up anytime you want as long as you have responsibility right no one is keeping tabs on you you can enter the lab whatever time that you want but the important thing is you have to make progress and the most challenging part is being accountable for the progress of your research project because when we're in undergraduate we study as a batch right so you and your friends will take midterm exam at the same time have final exam at the same time right and so the semester ends at a defined date and begins at a defined date and you have vacation right but then doing a PhD is much more flexible right every day is like the same and so the challenge is motivating yourself planning wisely so that you can complete your PhD degree in a reasonable amount of time because the time span varies from 3 years to up to 8 years and so luckily I was able to complete my PhD degree in four years and so I received my PhD degree when I was 24 and I received my bachelor's degree when I was 20 and this was partly because I took the California high school proficiency examination which is the equivalent of a high school diploma and so I took that when I was 16 and so I entered college at 16 and so at 24 I continued to pursue my journey further in academia and pursue further on bioinformatics research and so time really flies now it's been my 14 year in academia and over the course of this 14 years I was fortunate to be given the chance to supervise very bright PhD students masters to them and undergraduate students and during the course of this 14 years and so together I think more than 40 to 50 students almost 10 PhD students and about 10 master's student and more than 20 to 30 undergraduate students and also I was also host to several international students coming from Sweden Germany United States and China and so aside from having good friends I also was given the chance to also learn from the students as well because each student pursued an independent research project and so I learned from all of their projects and this has given me more chance and opportunity to further expand my knowledge of the field by learning from students as well as fly learning from the research findings that our project took us to and so personally if I did not pursue the PhD degree I'm not sure whether I would be acquainted with my coy adviser who introduced me to this wonderful world of data mining and now called data science and so I'm forever grateful to my PhD advisors and coy advisors and so I hope that this video brings Valle and if you find it useful please give it a thumbs up and if you haven't yet subscribed please subscribe to the channel for more awesome contents on data science and so it's always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videoswelcome back to the data professor YouTube channel if you new here my name is Shannon nontox and Ahmad and I'm an associate professor of bioinformatics on this YouTube channel we cover about data science concepts and practical tutorials so if you're into this type of content please consider subscribing are you thinking of pursuing a PhD degree or the Doctor of Philosophy program or you might be wondering whether the PhD degree is of any benefit to your data science journey or data science career and so in this video I'm going to discuss about that point and so without further ado let's get started so before we begin to talk about the topic of this video so let me bring you back to the time when I was in junior high school when I was in the eighth grade so I participated for the first time in the Science Olympiad so in the Science Olympiad that we participated in we participated at the regional level meeting at the Los Angeles District in California and at the time I participated in two events so the first one was about building a bottle rocket and so the goal of this competition was to use a plastic coke bottle to build a rocket which is filled by water and so you would pump water inside the bottle rocket and at the head of the bottle you would hide a parachute inside and so the release of the high pressure inside the bottle rocket will make it propel up in the air and as it goes down the cap of the rocket will come off revealing the parachute which will expand and so the goal of this competition is to have the bottle rocket stay in the air for as long as possible and so the team with the longest air time would win and so our team came in sixth place and so the second competition was to build a wooden Tower Bridge whereby we will build this tower bridge to be between two table and right at the middle of the tower we're going to tie a string and the string will be tied to a bucket and so we will put in sand inside the bucket and the tower that is able to hold the most we'll win and so our team came in third place and so we got the bronze medal for that event and so it turns out that our team was selected to compete at the state level and so they call this the state finals and so we're representing Los Angeles to compete at the state level which will comprise of other cities in the state of California but unfortunately at the state final we we did not win any competition and so we did not advance further so irregardless of the outcome of this the ambiance of science and the fascination the science has to offer has continued to spark my interest in science and so I returned back to Thailand and I pursued a Bachelors of Science degree with a major in biomedical science and as part of the requirement of the bachelors degree we had to carry out a small thesis project and the thesis project was about studying the diurnal migration patterns of copepods which is kind of like a small shrimp and we're studying the diurnal patterns that these copepods has and so during the daytime they will be hidden in the depth of the water and at night they would rise up to the surface level and thrive and so the results of this research project was funded by our University and we were given the chance to present at an international conference on copy pot and this is all made possible by my major advisor and so for the first time I was given the opportunity to prepare a poster presentation and in doing so it was my first exposure to the scientific community whereby I was given the opportunity to present my research findings to other scientists in the field so these scientists are the ones who published papers in books in prestigious scientific journals that I would read about and I would see them in person at the conference and so being able to talk to the original authors of the paper that I have read was an awesome experience and so after graduating the Bachelors of Science degree I decided to pursue the PhD degree or the Doctor of Philosophy degree program and so did not first started out in data science so the PSD research thesis was about the protein engineering of the green fluorescent protein from the jellyfish and so the goal of the study was to engineer the GFP protein to be able to bind to metal ions and in response to that it will have changes to their fluorescence property so particularly the GFP protein will bind to the metal ion and in response to that the fluorescence intensity of the GFP will decline and so this is called the quenching effect and so in order to create such a engineered GFP protein I have to study about the effects of the mutation particularly the mutations of amino acids and so in nature there are a total of 20 amino acids and we use computer software such as pi mo which is a molecular visualization software to be able to visualize the crystal structure of the GFP protein and select residues that we believe to be responsible for the fluorescence property and so in three-dimension the amino acids are interacting inside with other amino acids and so the mutation of key amino acid will allow us to modulate the fluorescence property and so at this time I have not yet been involved with data science however that part will come soon and so at the end of my first year of my PhD study I had to present my findings at an international conference and it is at this international conference that I met another researcher who has just graduated his PhD degree from the US on data mining and so my PhD advisor was acquainted with this researcher and we had a discussion about my research process eventually we networked and after the conference I met the researcher who later become my co advisor for my research thesis and at the time he gave me this book about neural networks in chemistry and it was for the first time that I have been exposed to the concept of neural network and about data mining and about making predictions and classifications and so this was back in the year of 2004 and so in that year my co adviser has given me this data set up DNA splice Junction site prediction and so essentially the goal of this project was to slide through a DNA sequence and upon reaching the splice Junction site or the site at which the DNA will be cut we will look at the environment of the DNA sequence and so we're gonna have a window size at the middle of the sequence that we're looking at which will have a length of two nucleotide bases and we're gonna have flanking at the left and right to be about three or five nucleotide bases and so this imaginary window will slide through the DNA sequence one by one in order to generate many possible DNA fragments to be predicted and so each sliding of the window will generate one set a feature representation and so we're gonna use machine learning to classify whether the particular DNA fragment is containing the DNA splice Junction sites or not and so the result of that paper was published in the year of 2005 and so it was my first publication on applying data mining in predicting DNA splice Junction sites and so notice that back in 2005 this field was called data mining or another popular term would be knowledge discovery in data and the go-to resource would be the K DD Nuggets which is also an excellent source for data science in this day and age as well and so during the course of the PhD degree I was involved with a total of nine research projects and so for each project it has its own challenges and issues and problems so each of the nine research project is in the domain of bioinformatics and so it is now during the PhD degree starting from my second year of the PhD degree where I was working in the field of bioinformatics and so essentially it is the application of computational approaches including data science data mining on solving biological problems so it was very fun very fascinating and so the fun lies in the opportunity where we are able to apply computational approaches particularly I was using data mining to try to understand and make sense biological data and so we call that during the first year on my PhD study I was working in protein engineering where I engineer a mutant GFP protein and so for the second project of my PhD study I applied the concept of data mining to predict the colors of the GFP proteins and so no one has done this before and so what I did was collect research articles discussing about the engineer of different GFP color variants and I would collect the information about which amino acids were mutated and upon mutation what was the excitation wavelength what is the emission wavelength so the excitation wavelength is essentially you excite the GFP protein by shooting lights to the protein and the protein will accept the light and upon accepting the light it will become excited and then during the excited phase it will emit energy and as a result it will also emit color and so the process of receiving photon of light and emitting energy and then as a result emitting color as well was a fascinating project for which I was involved with whereby I described the GFP protein in terms of the mutated amino acid and I described the amino acids using quantum mechanics or computational chemistry whereby the chemical functionality of the GFP chromophore can be represented in numerical form as a set of molecular descriptors or a set of molecular features so essentially we're transforming the GFP molecule into a set of numerical features that we could use as input to the data mining model and make a prediction into what is the excitation wavelength given the quantum mechanic descriptor and also to predict the emission wavelength given the quantum mechanic descriptor and so this project resulted in about two to three papers and so other research project as part of my PhD thesis was on applying computational chemistry to compute the molecular features of molecule and then apply data mining to predict the antioxidant activity of a chemical library which has the capability of becoming an antioxidant molecule and so during the course of my PhD degree I was involved with a total of nine research projects and each research project requires understanding of the biological domain because we have to figure out which molecular features we should select to model or make a prediction about and it requires reading a lot of literature becoming acquainted with the domain knowledge and the other end to understand computational concepts to apply the proper and appropriate machine learning algorithms to do appropriate data cleaning the equation data pre-processing and so back at the time I was not using any form of programming language it was purely using text editor such as ultraedit or notepad plus plus and also Microsoft Excel SPSS to do these statistical analysis and the weak data mining software to build data mining models such as neural networks decision tree models multiple linear regression support vector machine and so all of this was built using the graphical user interface of weeka and so after using this for quite some time it was becoming quite burdensome to optimize parameters by manually clicking at different step sizes because imagine that for support vector machine let's say that you want to optimize the radial basis function kernel you want to optimize the C parameter and the gamma parameter and so imagine that for the C parameter you want to optimize ten possible values and for the gamma parameter you want to optimize another ten possible values then a 10 by 10 matrix will give you over 100 pairs of parameters to optimize and imagine that each calculation took about one hour or up to 24 hours depending on the size of the data set and so in order to finish the project we have to do several runs like this and so back at the time I was fortunate to have access over the weekend to the computer lab and so over the weekend I would be running calculations using different parameters for the support vector machine or for a neural network calculation or with different seat number and so imagine using 50 computers in the computer lab to run the simulation and after the simulation is complete have to manually copy and paste the results into a text file and then copy that into the thumb drive or the USB Drive and then consolidate all of the information and so I would remember using macro feature of the text editor in order to pre-process the text results and so all of this data mining project was without the use of a programming language such as R or Python because for a biology major back at the time learning programming seems very scary it seems like a formidable task it seems something that is out of reach and so at the time I selected the EC path which might not be easy if we think of it retrospectively of using only the GUI software's and so if I could turn back time I would probably tell my 2005 version to start learning programming language because of the amazing benefits that it has in helping to analyze big data sets or perform automated and programmatic pre-processing of data and model building in a optimal way time efficient way as well and so you can see that during the course of my four years that was spent for the PhD degree all of the nine research project was about data mining and all of the project requires understanding the domain knowledge of biology and trying to translate that into meaningful molecular features that we could use to construct predictive model and after the construction of predictive model we would try to extract knowledge out of the model try to interpret the importance of the feature that influenced the prediction and so once we understand what features are important we will then use that to guide the experimental process in order to engineer the GFP in a different way depending on which molecular feature that we want to control or exert our control over in order to bring about the different colors and so thinking back how did doing a PSD helped in my data science career as a associate professor of bioinformatics and so the PhD degree provided me dedicated time to two nine research projects and so I guess if you're able to do data science projects without pursuing a PhD degree I mean not only to reproduce toy data set but to actually and rigorously be involved in or create the data science project from start to finish and have a firm understanding of all the features involved in the model building and its interpretation afterward and then perhaps it would be equivalent to a PhD degree and so the thing about a PhD degree is not the title that we receive after completing the degree but the most important part of all is the journey and it is this journey that has taught me about project management how to split up the research project into fragments and for each fragment like for example data collection data pre-processing data understanding expiratory data analysis doing due diligence by digging the literature review downloading thousands of research articles scanning the results and discussion looking at all the results table from the paper compiling data sets manually and if thinking retrospectively if I had a knowledge of either R or Python it would have been so much fun but nevertheless it was still fun and so I was manually copying data from the PDF file and then recording the numbers into the Excel so all of the process from beginning to the end of the research project was done by myself starting from reading the scientific literature by downloading thousands of research articles scanning through all of those collecting the data of set manually and after collecting the data set figure out how to represent the features that we want to predict and make the prediction and if the prediction was not good then hit the books read more about the literature to understand what went wrong and what feature were we missing which made the prediction bad or poor and then to figure out which features to calculate so as you can see it's about feature engineering and once we have a optimal feature then we saw your man the increase in the prediction performance like for example in the GFP project and so the initial model provided a correlation coefficient between the experimental and predicted excitation maxima and emission Maxima of about 0.6 and so I hit the books read the research articles and discovered that the GFP chromophore has two anionic states meaning that the structure is different and so what I did was represent the GFP chromophore by two different structure because initially it was represented by one structure and depending on the excitation Maxima because the GFP has two peak one at 395 and one at 408 and so the 395 nanometer was the major peak and 408 was a minor peak and so what I did was depending on whether the GFP chromophore compiled from the paper has excitation maxima at 395 or 408 I would draw the chemical structure differently this time whereby the one at 395 would be given one chemical structure and the one with 408 would be given a different chemical structure and so this tweak in the molecular features or by means of feature engineering inspired by reading the literature led to dramatic increase in the prediction performance as observed by the correlation coefficient between the experimental and the predicted values of excitation and emission Maxima to be in excess of 0.9 and so as you can see the performance increased from 7.6 to 0.9 and so this is not happen in a day it happened over the course of about four months for this research project and so each research project was done project by project so the completion of one project will result in preparing the manuscript for publication and sending it off for publication and then starting a next project and sometimes at one given time I might be involved in two to three research project at the same time either collecting my own dataset or using a dataset collected by a fellow colleague who has done experimental work and then we would use their data set to make that prediction and so we have to have domain knowledge try to understand the biology behind the data set and then try to select appropriate features by performing appropriate feature engineering and in using that for model building and then after the model has been built perform post model analysis try to make sense of the feature and why they contributed to good prediction or bad prediction and so thinking retrospectively did the PhD degree help me as a biomedical data scientist I would definitely say yes but this is only based on my own story and so everyone has their own unique circumstances and story and so in conclusion will you benefit from doing a PhD it really depends on your unique circumstances if your work environment right now is allowing you to pursue independent research projects something that you could have full control over the entire data mining or data science project from data collection to feature engineering to model building model optimization deployment and perhaps writing a technical report of some sort presenting the findings at conferences or to clients and so the experience that you would have gained during the course of these research project or industrial project would probably be equivalent to doing a PhD degree and another good part of doing a PhD degree is the mentor and so the mentor or the PhD advisor has important and significant role in the success of your PhD degree and so the mentor is kind of like the coach of a basketball team or the manager of a baseball team or a trainer so your coach will keep you accountable for the progress of your data science or research project and another thing that I liked about doing the PhD degree was the dedicated time that you have and the flexibility is so immense if you compare it to when you're in undergraduate degree so you don't have to wake up early like the days of undergraduate you could wake up anytime you want as long as you have responsibility right no one is keeping tabs on you you can enter the lab whatever time that you want but the important thing is you have to make progress and the most challenging part is being accountable for the progress of your research project because when we're in undergraduate we study as a batch right so you and your friends will take midterm exam at the same time have final exam at the same time right and so the semester ends at a defined date and begins at a defined date and you have vacation right but then doing a PhD is much more flexible right every day is like the same and so the challenge is motivating yourself planning wisely so that you can complete your PhD degree in a reasonable amount of time because the time span varies from 3 years to up to 8 years and so luckily I was able to complete my PhD degree in four years and so I received my PhD degree when I was 24 and I received my bachelor's degree when I was 20 and this was partly because I took the California high school proficiency examination which is the equivalent of a high school diploma and so I took that when I was 16 and so I entered college at 16 and so at 24 I continued to pursue my journey further in academia and pursue further on bioinformatics research and so time really flies now it's been my 14 year in academia and over the course of this 14 years I was fortunate to be given the chance to supervise very bright PhD students masters to them and undergraduate students and during the course of this 14 years and so together I think more than 40 to 50 students almost 10 PhD students and about 10 master's student and more than 20 to 30 undergraduate students and also I was also host to several international students coming from Sweden Germany United States and China and so aside from having good friends I also was given the chance to also learn from the students as well because each student pursued an independent research project and so I learned from all of their projects and this has given me more chance and opportunity to further expand my knowledge of the field by learning from students as well as fly learning from the research findings that our project took us to and so personally if I did not pursue the PhD degree I'm not sure whether I would be acquainted with my coy adviser who introduced me to this wonderful world of data mining and now called data science and so I'm forever grateful to my PhD advisors and coy advisors and so I hope that this video brings Valle and if you find it useful please give it a thumbs up and if you haven't yet subscribed please subscribe to the channel for more awesome contents on data science and so it's always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videos\n"