The Machine Learning Model Building Process: A Comprehensive Overview
A typical tabular data set, such as that created using pandas or R, is often represented as follows:
So, as you can see, it looks pretty much like the spreadsheet from Microsoft Excel or Google Sheets. Each column represents a variable, such as variable x1 in the first column, variable x2 in the second column, and so on. The variables are the input variables, also known as independent variables, while the output variable is the dependent variable. In a typical data science project, the input variables are used to build a model that finds a correlation or relationship between the input variables and their corresponding output variable.
For example, it's a typical equation y equals f(x), meaning y equals function of x. The function can be a linear regression equation, such as y equals mx + b, where m is the slope and b is the intercept. If x is equal to 1, plugging in 1 into x gives us 5x, which means multiplying 5 by 1 results in 5, and adding 5 results in 10. Therefore, y equals 10 given x equals 1.
In order to build a machine learning model, we need to perform data splitting. The purpose of data splitting is to allow us to use a portion of the data set, known as the training set, to build a prediction model, and then use another portion, known as the test set, to evaluate the performance of the model. From 100 data points, for example, we can split it into an 80-20 ratio, where 80 goes to the training set and 20 goes to the testing set. We will use the training set to create the prediction model.
When building a prediction model, we have several options, including Support Vector Machine, Deep Learning, Gradient Boosting, K Nearest Neighbor, Decision Trees, and Random Forest. With each of these learning algorithms, there are hyperparameters that need to be optimized. For example, in Random Forest, some parameters that can be optimized include the number of trees and the maximum features used. Additionally, we may decide to perform feature selection, which involves reducing the number of features from hundreds or thousands to a smaller subset.
One form of feature selection is using Principal Component Analysis (PCA) to reduce the original dimension of the feature. This results in compressed versions of the features, known as principal components, such as PC1, PC2, PC3, and so on. There are other forms of feature selection, including genetic algorithm and particle swarm optimization.
Once we have built the prediction model, we can apply the data samples, such as input variables, into the model to make predictions. The model will output the predicted values for y. We can then use performance metrics, such as accuracy, sensitivity, specificity, correlation coefficient, mean squared error, root mean squared error, and R-squared, to evaluate the performance of the model.
The best way to learn data science is to do data science itself. This article provides a comprehensive overview of the machine learning model building process. If you found value in this article, please give it a thumbs up and subscribe to our channel for more content. Hit the notification bell to be notified of the next video.
"WEBVTTKind: captionsLanguage: enin this video i'm going to be giving you a quick high level overview of the machine learning model building process and so without further ado we're starting right now so as you can see here the start of the machine learning model building process starts by the selection of an initial data set that you would like to use for your model building so this initial data set could be a subset of the originally full complete data set so for example if you have a basketball data set you might be selecting data for a particular team for your analysis or you might be selecting data for a particular position of the players like for example you would like to analyze data for only shooting card and so one of the first step that i would normally do when i get my hands on a new data set is to perform exploratory data analysis and so what ada will allow you to do is to give you the opportunity to get a high level overview of the data set so you could do that by using standard statistical approaches for example you could simply create a simple scatter plot of some of the variables let's say that you're interested in predicting the drug activity and then you have this hypothesis that the molecular weights might be correlated somehow to that activity so you could do a simple scatter plot and look at the relative correlation or let's say that you have a hunch that variable x1 and x5 might have some correlation and then you could do that and the thing is you could iteratively create scatter plots and play around with the data so at this stage you don't have to worry about model building you just skim through the data set looking at the general distribution you might also want to compute the average value and also the standard deviation of each of the variables in order to get a relative glimpse of the relative variance of each of the variable and typically what i would do in a typical data science project is to remove variables that have low variation and the reason being that when there is low variation it also implies that that particular variable will not provide enough information in our analysis and so what we would do is we would remove these descriptors and so they are called redundant features so variables that have near serial variants will be removed so aside from standard statistical analysis in doing the eda i would also use principal component analysis and also self-organizing map and so these two are unsupervised learning approaches and so the advantage of the pca is that it allows you to get a relative glimpse of the distribution of the data for example if you are taking a look at the scores plot of the pca analysis you will be able to see the relative distribution of the data samples let's say that our data samples are basketball players that we get to see the distribution of each of the player as a single dot in the pca scores plot while the pca loadings plot will allow us to see the variables for example the points scored per game the number of assists the number of rebounds etc self-organizing map is also another unsupervised learning approach meaning that you don't need to use the dependent variables and so you will be using only the independent variables or the x variables and so after you have performed some initial exploratory data analysis you would also perform some data cleaning and also data curation and so the thing is worth mentioning is that prior to performing pca or som you would also perform some form of data cleaning and data curation prior to that as well as the removal of redundant features and then you will do exploratory data analysis once you have cleaned the data you could call this data pre-processed data set or you could also call it curated data set and so a typical tabular data set will look like this so this is the typical data frame let's say that you're using pandas or the r data frame this is how a typical tabular data would look like so as you can see it looks pretty much like the spreadsheet from microsoft excel or google sheets so each column will represent a variable like variable x1 in the first column variable x2 in the second column and then you have variable y and so your x variables will be the input variables and you could also call that the independent variables and the output variable is the dependent variable so in a typical data science project you would make use of input variables and then you would build a model whereby you have and you provide the input variables and output variable pair to the learning method and then the learning method will find a correlation we'll find a relationship between the input variables and their corresponding output variable so for example it is a typical equation y equals to f of x meaning y equals to function of x and so the function will be a typical equation like a linear regression equation y equals to m x plus b so in this example you have for example y equals to 5x plus 5 and if x is equal to 1 then you plug in 1 into x and so you have 5x so therefore you have 5 multiplied by 1 and so you get 5 and then you plus 5 and therefore you get 10. and therefore y equals to 10 given x equals to 1. and so for the model building process in order to build the machine learning model we will have to perform data splitting and so the purpose of data splitting is to allow us to use a portion of the data set where we call it training sets to build a prediction model and then we will use the prediction model to evaluate on the test sets and so the test set is another portion of the data set so from 100 you would split the data into two portion or more than two portions so this really depends on the methodology that you're using so a very simple example would be to split it using the 80 20 split ratio and so 80 will go to the training set and 20 will go to the testing sets and so we will use the training set to create the prediction model so we could use it to create the training model and also the cross validation model for example in a enfold cross validation and in the building of a prediction model you would also have the decision to decide whether you would like to use support vector machine deep learning gradient boosting machine k nearest neighbor decision trees random forest etc and with each of these learning algorithms you will also have to take a look at what are the hyper parameters that are governing the prediction process so take a moment to figure out what hyper parameters that you would need to optimize like for example in random forest some parameters that you could optimize include the number of tree and also the maximum features that you're going to be using and afterwards you could also decide to whether to perform feature selection whereby in your curated data set you might have hundreds of features and so if you decide that you would like to reduce the feature into a smaller portion like for example you could use pca to reduce the original dimension of the feature from a hundred or thousands of feature into a smaller and more manageable subset of feature and so the the feature from pca will be the compressed version and therefore you call it principal component one or pc one and then you have pc2 pc3 pc4 pc5 etc so that is one form of feature selection and there are other like genetic algorithm particle swarm optimization etc and so once you have built the prediction model you would apply the data samples for example the input variables into the prediction model and then the model will split out the output variable or in a way it will make the prediction and so given x it will predict y and so once you have the predicted value you would use this to evaluate your model performance by converting all of the predicted y values into a performance metric like for example the accuracy sensitivity specificity math is correlation coefficient if you're doing classification or if you're doing regression you could use the mean squared error the root mean squared error and also the r squared which provides the goodness of fits and so there you have it the complete and high level overview of the machine learning model building process if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videosin this video i'm going to be giving you a quick high level overview of the machine learning model building process and so without further ado we're starting right now so as you can see here the start of the machine learning model building process starts by the selection of an initial data set that you would like to use for your model building so this initial data set could be a subset of the originally full complete data set so for example if you have a basketball data set you might be selecting data for a particular team for your analysis or you might be selecting data for a particular position of the players like for example you would like to analyze data for only shooting card and so one of the first step that i would normally do when i get my hands on a new data set is to perform exploratory data analysis and so what ada will allow you to do is to give you the opportunity to get a high level overview of the data set so you could do that by using standard statistical approaches for example you could simply create a simple scatter plot of some of the variables let's say that you're interested in predicting the drug activity and then you have this hypothesis that the molecular weights might be correlated somehow to that activity so you could do a simple scatter plot and look at the relative correlation or let's say that you have a hunch that variable x1 and x5 might have some correlation and then you could do that and the thing is you could iteratively create scatter plots and play around with the data so at this stage you don't have to worry about model building you just skim through the data set looking at the general distribution you might also want to compute the average value and also the standard deviation of each of the variables in order to get a relative glimpse of the relative variance of each of the variable and typically what i would do in a typical data science project is to remove variables that have low variation and the reason being that when there is low variation it also implies that that particular variable will not provide enough information in our analysis and so what we would do is we would remove these descriptors and so they are called redundant features so variables that have near serial variants will be removed so aside from standard statistical analysis in doing the eda i would also use principal component analysis and also self-organizing map and so these two are unsupervised learning approaches and so the advantage of the pca is that it allows you to get a relative glimpse of the distribution of the data for example if you are taking a look at the scores plot of the pca analysis you will be able to see the relative distribution of the data samples let's say that our data samples are basketball players that we get to see the distribution of each of the player as a single dot in the pca scores plot while the pca loadings plot will allow us to see the variables for example the points scored per game the number of assists the number of rebounds etc self-organizing map is also another unsupervised learning approach meaning that you don't need to use the dependent variables and so you will be using only the independent variables or the x variables and so after you have performed some initial exploratory data analysis you would also perform some data cleaning and also data curation and so the thing is worth mentioning is that prior to performing pca or som you would also perform some form of data cleaning and data curation prior to that as well as the removal of redundant features and then you will do exploratory data analysis once you have cleaned the data you could call this data pre-processed data set or you could also call it curated data set and so a typical tabular data set will look like this so this is the typical data frame let's say that you're using pandas or the r data frame this is how a typical tabular data would look like so as you can see it looks pretty much like the spreadsheet from microsoft excel or google sheets so each column will represent a variable like variable x1 in the first column variable x2 in the second column and then you have variable y and so your x variables will be the input variables and you could also call that the independent variables and the output variable is the dependent variable so in a typical data science project you would make use of input variables and then you would build a model whereby you have and you provide the input variables and output variable pair to the learning method and then the learning method will find a correlation we'll find a relationship between the input variables and their corresponding output variable so for example it is a typical equation y equals to f of x meaning y equals to function of x and so the function will be a typical equation like a linear regression equation y equals to m x plus b so in this example you have for example y equals to 5x plus 5 and if x is equal to 1 then you plug in 1 into x and so you have 5x so therefore you have 5 multiplied by 1 and so you get 5 and then you plus 5 and therefore you get 10. and therefore y equals to 10 given x equals to 1. and so for the model building process in order to build the machine learning model we will have to perform data splitting and so the purpose of data splitting is to allow us to use a portion of the data set where we call it training sets to build a prediction model and then we will use the prediction model to evaluate on the test sets and so the test set is another portion of the data set so from 100 you would split the data into two portion or more than two portions so this really depends on the methodology that you're using so a very simple example would be to split it using the 80 20 split ratio and so 80 will go to the training set and 20 will go to the testing sets and so we will use the training set to create the prediction model so we could use it to create the training model and also the cross validation model for example in a enfold cross validation and in the building of a prediction model you would also have the decision to decide whether you would like to use support vector machine deep learning gradient boosting machine k nearest neighbor decision trees random forest etc and with each of these learning algorithms you will also have to take a look at what are the hyper parameters that are governing the prediction process so take a moment to figure out what hyper parameters that you would need to optimize like for example in random forest some parameters that you could optimize include the number of tree and also the maximum features that you're going to be using and afterwards you could also decide to whether to perform feature selection whereby in your curated data set you might have hundreds of features and so if you decide that you would like to reduce the feature into a smaller portion like for example you could use pca to reduce the original dimension of the feature from a hundred or thousands of feature into a smaller and more manageable subset of feature and so the the feature from pca will be the compressed version and therefore you call it principal component one or pc one and then you have pc2 pc3 pc4 pc5 etc so that is one form of feature selection and there are other like genetic algorithm particle swarm optimization etc and so once you have built the prediction model you would apply the data samples for example the input variables into the prediction model and then the model will split out the output variable or in a way it will make the prediction and so given x it will predict y and so once you have the predicted value you would use this to evaluate your model performance by converting all of the predicted y values into a performance metric like for example the accuracy sensitivity specificity math is correlation coefficient if you're doing classification or if you're doing regression you could use the mean squared error the root mean squared error and also the r squared which provides the goodness of fits and so there you have it the complete and high level overview of the machine learning model building process if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videos\n"