Assessing Model Performance in Classification Problems
So we have an answer but what does it mean well we need to assess the model performance by calculating some metrics now it's a classification model there are four possible outcomes uh and so the first outcome is that they didn't buy um any organics in real life and we also predicted they didn't buy organics so that's a great outcome we got the uh the answer correct. The other good case is that they did buy organics and we predicted that they buy organics again we got it correct and there are two bad cases unfortunately so the first one is that they didn't buy organics but we predicted they did it's called false positive and uh the final case is they did buy organics but predict they didn't that's called a false negative. We can count how many um cases we got of each of these these four outcomes using the confusion matrix method or function and so this takes two inputs the first input is what were the actual responses so we called this responses test and then we can compare this the second argument is the predicted responses so let me run this and you can see uh i think they're called response tests in there there we go uh so here we get a two by two matrix and the good cases are on the um the diagonal the bad cases are on the off diagonal so in this case there were 2,940 customers where they didn't buy organics and we predicted it correctly 482 cases but they did buy organics and we predicted incorrectly so you can see the four values in the matrix correspond to these four cases.
Open that uh little table and then we have the bad cases um on on the off diagonal so 670 false positives 189 false negatives now it's really difficult to tell just from the confusion matrix like these numbers good or bad so um there are actually lots of different metrics you can use to assess the performance you can model and classification report will print a lot of these um so the output is pretty fast. You've got four different columns lots of different numbers they're actually typically only five numbers that you often care about and the easiest of them a good one good place to start is with model accuracy so what this means is what are the fractions of values that were correctly predicted so in this case it's the number of true negatives plus the number of true positives divided by all the possible cases so let's go back to the matrix so uh true negatives um that's 2940 plus two positives 482 divided by all the different cases so all these different uh things added together that's basically the number of cases that were in in the testing set um and so we do a bit of arithmetic and that's going to give us the model accuracy um there are four of the metrics uh called uh precision recall for each of the the two cases i think it's worth me getting into these in detail um i've left these here um written down so if you want to follow through on workspace then you can look at these at your leisure but they sort of divide up uh different ways of assessing like how many um false negatives and false positives you have all right.
So let's uh print out this uh classification report uh it takes the same arguments as confusion matrix so the actual responses and the predicted responses i am going to copy and paste these again so you can see what this looks like and actually it's going to print really badly so you have to wrap it in a call to print again and there it looks a bit nicer so you can see this accuracy score uh we've got 0.8 so the model is right about 0.8 of the time is that good or bad i don't know the uh the retail scenario well enough to say whether or not that's any good there's probably ways of improving the model to be fair because we just did something very simply um and then those other uh four uh numbers that i said that are quite useful the precision recall values uh those are in the top left if you want uh but for now we'll just stick with accuracy of point eight.
If You Enjoyed This Tutorial
If you've enjoyed this tutorial then dating campus a lot more courses that cover logistic regression in python a good place to start is machine learning with psychic learn so this provides as a general introduction to modeling psychic learn and it does cover logistic regression there are two courses that cover regression uh linear and logistic regression in a lot more depth so that's an introduction to regression stats models in python and intermediate progression besides models and python while they do use a different package the status models package um they do get into a bit more detail about how the algorithms work and how you assess model performance i hope you've enjoyed this tutorial uh look out for uh more uh tutorials coming from datacamp and please do take these courses please do have a play around with workspace thank you
"WEBVTTKind: captionsLanguage: enhi welcome to this supervised machine learning tutorial on how to do a logistic regression in python click the link in the video description to open this in dating up workspace so you can follow along before we do some code let's go through with a little bit of the theory first question my task is when can i actually use logistic regression and there are two main criteria the first thing is that the response variable the thing you're trying to predict that needs to be binary like true or false or categorical secondly the observations need to be independent we're going to be using psychic learn here pretty standard i also quite like status models for doing logistic regression but the fact is logistic regression is such a common technique that every single machine learning framework in python allows you to do this so you can use picard tensorflow keras python or whatever you fancy we're going to be doing a case study today on predicting organic product purchases bit of a tongue twister now the background to this data set is that a supermarket was trying to incentivize people to buy organic products so it gave coupons to its loyalty program members and then checked did they actually buy any organic products after they got the coupon i got this data set from kaggle the original problem this has been murky and we've used a subset here just to make it a little bit cleaner to analyze the data is in csv format so we're going to start off using pandas to import it and then do a little bit of data manipulation i'm going to switch to cycle learn for the modeling so let's write some code we start by importing pandas using the usual alias pd and then we're gonna import some individual functions from psychic learn so i'm just gonna uh do a bit of copy and paste we're gonna import from three different sub modules so the first thing is from the model selection submodule we're going to import train uh test split that's going to allow us to specific set date setup into training and testing sets and then from the linear model sub module we're going to import logistic regression and then finally to analyze model performance so from the metrics of model module we're going to import confusion matrix and classification report let's run those so now we've got our functions available and this is uh in this csv file called organics.csv so we're gonna do a pde.greed underscore csv we're gonna pass it the path to the file so organixx.csv i'm gonna copy and paste so don't do any typos and in this case i'm gonna call it organics let's copy and paste the file name and put it at the end of the cell so we can print it out you see it's just works basically it's nice interactive tables and we have six different columns in the data set it's got three categorical columns so gender geographic region and loyalty status we've got two numeric uh features we've got affluence how rich they are and uh age on the right hand side of this data frame we've got this uh purchased organics column that is a zero if they didn't purchase any organics and a one if it did so this is gonna be our response column on the right as a data dictionary if you want more details on what was in the data set so first thing scikit-learn really doesn't like categorical columns so we've got to convert those first three columns the gender geography region loyalty status into uh dummy columns with ones and zeroes so we're going to use the pandas get dummies function i'm going to pass it the whole uh data frame organics and let's call this organics dom done for dummies and you can see what the result is let's run this so now rather than having a single column for gender we've got three for three different cases female male unknown and these have got ones and zeros in so the whole thing is numeric same for uh geographical region and loyalty status next step we need to split the data into response and explanatory columns so uh the response uh that's just going to be organics dumb and then we're going to take the purchased organics column and then the explanatory variables these are going to be organics wrong uh it's going to be everything except the purchased organics column uh so in this case we're going to do drop i say cons equals scoring basis again purchased organics so uh the response that's going to be a panda series it's just a single column explanatory is going to be a data frame with one less column than the original so let's run this next step we need to split the data into training and testing sets so we're going to fit the model to the training set and we're going to do predictions on the testing set i'm going to use train test split for this tray test split and this takes two inputs it takes the explanatory variables and the response i'm just using all the other default settings and it's going to return four different things so it's going to return explanatory train explanatory test response train and response test let's run this now we're about ready to fit the model as the training set it's going to require two steps so the first thing is that we need to create a model object it's called mdl for model and we call logistic regression scikit learns a little bit weird in that by default it uses a technique called regularization so this is used to minimize the effect of less important parameters in the model it's really helpful if you've got lots of different uh parameters in your model um it's a bit of a controversial default um today we're going to use standard logistic regression there's a traditional algorithm no regularization so we need to set this argument penalty equals let's copy and paste that i'm going to run this no real exciting output however more exciting is we get to fit the uh the model so we're going to call it the fits method and in this case we're going to pass it the uh explanatory training set and the response uh training set let's run that uh again not really exciting output just tells you which model you used however now we get to make some predictions on the testing sets we're going to start again with the model object and we're going to call the predict method and the predict method takes the explanatory variables from the testing set so let's run this and see what the apple looks like now we get uh a numpy array and the values of ones and zeros so same form as the uh the uh actual responses and uh let's save this to a variable so let's call it uh predicted responses and just to copy and paste this so you can uh see the output again and so we have an answer but what does it mean well we need to assess the model performance by calculating some metrics now it's a classification model there are four possible outcomes uh and so the first outcome is that they didn't buy um any organics in real life and we also predicted they didn't buy organics so that's a great outcome we got the uh the answer correct the other good case is that they did buy organics and we predicted that they buy organics again we got it correct and there are two bad cases unfortunately so the first one is that they didn't buy organics but we predicted they did it's called false positive and uh the final case is they did buy organics but predict they didn't that's called a false negative we can count how many um cases we got of each of these these four outcomes using the confusion matrix method or function and so this takes two inputs the first input is what were the actual responses so we called this responses test and then we can compare this the second argument is the predicted responses so let me run this and you can see uh i think they're called response tests in there there we go uh so here we get a two by two matrix and the good cases are on the um the diagonal the bad cases are on the off diagonal so in this case there were 2 940 customers where they didn't buy organics and we predicted it correctly 482 cases but they did buy organics and we predicted correctly so you can see the four values in the matrix correspond to these four cases open that uh little table and then we have the bad cases um on on the off day so 670 volts negatives 189 false positives now it's really difficult to tell just from the confusion matrix like these numbers good or bad so um there are actually lots of different metrics you can use to assess the performance you can model and classification report will print a lot of these um so the output is pretty fast you've got four different columns lots of different numbers they're actually typically only five numbers that you often care about and the easiest of them a good one good place to start is with model accuracy so what this means is what are the fractions of values that were correctly predicted so in this case it's the number of true negatives plus the number of true positives divided by all the possible cases so let's go back to the matrix so uh true negatives um that's 2940 plus two positives 482 divided by all the different cases so all these different uh things added together that's basically the number of cases that were in in the testing set um and so we do a bit of arithmetic and that's going to give us the model accuracy um there are four of the metrics uh called uh precision recall for each of the the two cases i think it's worth me getting into these in detail um i've left these here um written down so if you want to follow through on workspace then you can look at these at your leisure but they sort of divide up uh different ways of assessing like how many um false negatives and false positives you have all right so let's uh print out this uh classification report uh it takes the same arguments as confusion matrix so the actual responses and the predicted responses i am going to copy and paste these again so you can see what this looks like and actually it's going to print really badly so classification report is pretty stupid and this returns a string and just print terribly so you have to wrap it in a call to print again and there it looks a bit nicer so you can see this accuracy score uh we've got 0.8 so the model is right about 0.8 of the time is that good or bad i don't know the uh the retail scenario well enough to say whether or not that's any good there's probably ways of improving the model to be fair because we just did something very simply um and then those other uh four uh numbers that i said that are quite useful the precision recall values uh those are in the top left if you want uh but for now we'll just stick with accuracy of point eight uh if you've enjoyed this tutorial then dating campus a lot more courses that cover logistic regression in python a good place to start is machine learning with psychic learn so this provides as a general introduction to modeling psychic learn and it does cover logistic regression there are two courses that cover regression uh linear and logistic regression in a lot more depth so that's an introduction to regression stats models in python and intermediate progression besides models and python while they do use a different package the status models package um they do get into a bit more detail about how the algorithms work and how you assess model performance i hope you've enjoyed this tutorial uh look out for uh more uh tutorials coming from datacamp and please do take these courses please do have a play around with workspace thank youhi welcome to this supervised machine learning tutorial on how to do a logistic regression in python click the link in the video description to open this in dating up workspace so you can follow along before we do some code let's go through with a little bit of the theory first question my task is when can i actually use logistic regression and there are two main criteria the first thing is that the response variable the thing you're trying to predict that needs to be binary like true or false or categorical secondly the observations need to be independent we're going to be using psychic learn here pretty standard i also quite like status models for doing logistic regression but the fact is logistic regression is such a common technique that every single machine learning framework in python allows you to do this so you can use picard tensorflow keras python or whatever you fancy we're going to be doing a case study today on predicting organic product purchases bit of a tongue twister now the background to this data set is that a supermarket was trying to incentivize people to buy organic products so it gave coupons to its loyalty program members and then checked did they actually buy any organic products after they got the coupon i got this data set from kaggle the original problem this has been murky and we've used a subset here just to make it a little bit cleaner to analyze the data is in csv format so we're going to start off using pandas to import it and then do a little bit of data manipulation i'm going to switch to cycle learn for the modeling so let's write some code we start by importing pandas using the usual alias pd and then we're gonna import some individual functions from psychic learn so i'm just gonna uh do a bit of copy and paste we're gonna import from three different sub modules so the first thing is from the model selection submodule we're going to import train uh test split that's going to allow us to specific set date setup into training and testing sets and then from the linear model sub module we're going to import logistic regression and then finally to analyze model performance so from the metrics of model module we're going to import confusion matrix and classification report let's run those so now we've got our functions available and this is uh in this csv file called organics.csv so we're gonna do a pde.greed underscore csv we're gonna pass it the path to the file so organixx.csv i'm gonna copy and paste so don't do any typos and in this case i'm gonna call it organics let's copy and paste the file name and put it at the end of the cell so we can print it out you see it's just works basically it's nice interactive tables and we have six different columns in the data set it's got three categorical columns so gender geographic region and loyalty status we've got two numeric uh features we've got affluence how rich they are and uh age on the right hand side of this data frame we've got this uh purchased organics column that is a zero if they didn't purchase any organics and a one if it did so this is gonna be our response column on the right as a data dictionary if you want more details on what was in the data set so first thing scikit-learn really doesn't like categorical columns so we've got to convert those first three columns the gender geography region loyalty status into uh dummy columns with ones and zeroes so we're going to use the pandas get dummies function i'm going to pass it the whole uh data frame organics and let's call this organics dom done for dummies and you can see what the result is let's run this so now rather than having a single column for gender we've got three for three different cases female male unknown and these have got ones and zeros in so the whole thing is numeric same for uh geographical region and loyalty status next step we need to split the data into response and explanatory columns so uh the response uh that's just going to be organics dumb and then we're going to take the purchased organics column and then the explanatory variables these are going to be organics wrong uh it's going to be everything except the purchased organics column uh so in this case we're going to do drop i say cons equals scoring basis again purchased organics so uh the response that's going to be a panda series it's just a single column explanatory is going to be a data frame with one less column than the original so let's run this next step we need to split the data into training and testing sets so we're going to fit the model to the training set and we're going to do predictions on the testing set i'm going to use train test split for this tray test split and this takes two inputs it takes the explanatory variables and the response i'm just using all the other default settings and it's going to return four different things so it's going to return explanatory train explanatory test response train and response test let's run this now we're about ready to fit the model as the training set it's going to require two steps so the first thing is that we need to create a model object it's called mdl for model and we call logistic regression scikit learns a little bit weird in that by default it uses a technique called regularization so this is used to minimize the effect of less important parameters in the model it's really helpful if you've got lots of different uh parameters in your model um it's a bit of a controversial default um today we're going to use standard logistic regression there's a traditional algorithm no regularization so we need to set this argument penalty equals let's copy and paste that i'm going to run this no real exciting output however more exciting is we get to fit the uh the model so we're going to call it the fits method and in this case we're going to pass it the uh explanatory training set and the response uh training set let's run that uh again not really exciting output just tells you which model you used however now we get to make some predictions on the testing sets we're going to start again with the model object and we're going to call the predict method and the predict method takes the explanatory variables from the testing set so let's run this and see what the apple looks like now we get uh a numpy array and the values of ones and zeros so same form as the uh the uh actual responses and uh let's save this to a variable so let's call it uh predicted responses and just to copy and paste this so you can uh see the output again and so we have an answer but what does it mean well we need to assess the model performance by calculating some metrics now it's a classification model there are four possible outcomes uh and so the first outcome is that they didn't buy um any organics in real life and we also predicted they didn't buy organics so that's a great outcome we got the uh the answer correct the other good case is that they did buy organics and we predicted that they buy organics again we got it correct and there are two bad cases unfortunately so the first one is that they didn't buy organics but we predicted they did it's called false positive and uh the final case is they did buy organics but predict they didn't that's called a false negative we can count how many um cases we got of each of these these four outcomes using the confusion matrix method or function and so this takes two inputs the first input is what were the actual responses so we called this responses test and then we can compare this the second argument is the predicted responses so let me run this and you can see uh i think they're called response tests in there there we go uh so here we get a two by two matrix and the good cases are on the um the diagonal the bad cases are on the off diagonal so in this case there were 2 940 customers where they didn't buy organics and we predicted it correctly 482 cases but they did buy organics and we predicted correctly so you can see the four values in the matrix correspond to these four cases open that uh little table and then we have the bad cases um on on the off day so 670 volts negatives 189 false positives now it's really difficult to tell just from the confusion matrix like these numbers good or bad so um there are actually lots of different metrics you can use to assess the performance you can model and classification report will print a lot of these um so the output is pretty fast you've got four different columns lots of different numbers they're actually typically only five numbers that you often care about and the easiest of them a good one good place to start is with model accuracy so what this means is what are the fractions of values that were correctly predicted so in this case it's the number of true negatives plus the number of true positives divided by all the possible cases so let's go back to the matrix so uh true negatives um that's 2940 plus two positives 482 divided by all the different cases so all these different uh things added together that's basically the number of cases that were in in the testing set um and so we do a bit of arithmetic and that's going to give us the model accuracy um there are four of the metrics uh called uh precision recall for each of the the two cases i think it's worth me getting into these in detail um i've left these here um written down so if you want to follow through on workspace then you can look at these at your leisure but they sort of divide up uh different ways of assessing like how many um false negatives and false positives you have all right so let's uh print out this uh classification report uh it takes the same arguments as confusion matrix so the actual responses and the predicted responses i am going to copy and paste these again so you can see what this looks like and actually it's going to print really badly so classification report is pretty stupid and this returns a string and just print terribly so you have to wrap it in a call to print again and there it looks a bit nicer so you can see this accuracy score uh we've got 0.8 so the model is right about 0.8 of the time is that good or bad i don't know the uh the retail scenario well enough to say whether or not that's any good there's probably ways of improving the model to be fair because we just did something very simply um and then those other uh four uh numbers that i said that are quite useful the precision recall values uh those are in the top left if you want uh but for now we'll just stick with accuracy of point eight uh if you've enjoyed this tutorial then dating campus a lot more courses that cover logistic regression in python a good place to start is machine learning with psychic learn so this provides as a general introduction to modeling psychic learn and it does cover logistic regression there are two courses that cover regression uh linear and logistic regression in a lot more depth so that's an introduction to regression stats models in python and intermediate progression besides models and python while they do use a different package the status models package um they do get into a bit more detail about how the algorithms work and how you assess model performance i hope you've enjoyed this tutorial uh look out for uh more uh tutorials coming from datacamp and please do take these courses please do have a play around with workspace thank you\n"