Python Machine Learning Tutorial _ Handling Missing Data _ Databytes

**Handling Missing Data with Scikit-Learn: A Tutorial**

When working with datasets, missing data can be a significant challenge. It's essential to handle it properly to ensure accurate results and reliable conclusions. In this tutorial, we'll explore how to handle missing data using scikit-learn, a popular machine learning library in Python.

**Numeric Columns**

For numeric columns, one common approach is to use the mean value as an imputation strategy. This method assumes that the missing values can be replaced with a average value similar to other values in the dataset. So, we've got the same answer twice, which means our code is correct. However, this method has a caveat: it's only suitable for small amounts of missing data. Dropping data or using a heuristic approach like this when there are many features can improve performance.

**The Case with Numeric Columns**

When working with numeric columns, we need to use the mean value as an imputation strategy. This is because the mean provides a good estimate of the missing values in most cases. However, it's essential to note that this method only works well when there are small amounts of missing data. In such cases, the effect of any individual feature becomes negligible, and using many features (hundreds or thousands) can improve performance.

**The Role of Missing Values**

Another aspect to consider is the type of missing values. If a value is completely missing, it's not always possible to impute it with a meaningful value. In such cases, we need to use more advanced techniques to handle missing data. For example, using the mode (most frequent value) can be a good approach for categorical columns.

**Categorical Columns**

When working with categorical columns, the mean value doesn't make sense, as categories are not numerical values. In this case, we can use the mode as an imputation strategy. To do this, we create another simple computer object called `mf` (most frequent) and set the strategy argument to "mode." This will fill in missing values with the most frequent value in each column.

**Example with Categorical Columns**

Let's take a look at how we can use this approach for categorical columns. We'll create our `mf` computer object, pass it through the `fit_transform` method, and set the strategy to "mode." The resulting `np.array` will be converted back to a pandas DataFrame, which we can borrow from our original dataset.

**Iterative Imputer**

A more sophisticated approach is to use an iterative imputer. This method fits a predictive model in each column using other columns as features, and repeats this process several times to improve predictions and fill in missing values. The default model used by the iterative imputer is Bayesian Ridge regression, but you can change this to other models if needed.

**Enabling Experimental Code**

As of version 1.0.2, scikit-learn's `IterativeImputer` is considered experimental. To enable it, we need to start with `sklearn.experimental` and import the `enable_iterative_imputer` function. We can then create our `IterativeImputer` object, pass in our dataset, and call the `fit_transform` method.

**Example with Iterative Imputer**

Here's an example of how we can use the iterative imputer for numeric columns. We'll enable experimental code, create our `IterativeImputer` object, pass it through the `fit_transform` method, and get a numpy array as output. We'll then convert this back to a pandas DataFrame using the `borrow` method.

**Conclusion**

Handling missing data is an essential aspect of working with datasets in machine learning. While there's no one-size-fits-all approach, we can use various techniques such as imputing mean or mode values for numeric columns, and more sophisticated methods like iterative imputation for categorical columns. By following the steps outlined in this tutorial, you'll be able to effectively handle missing data and improve the accuracy of your machine learning models.

"WEBVTTKind: captionsLanguage: enin this tutorial we can look at how to handle missing data that means pre-processing your data set in order to get it ready for running a machine learning model or doing other analyses before we start coding let's cover some theory first of all you might wonder why you need to care about missing data and the shorter answer is that if you don't know some of the values in your data set you can't calculate summary statistics and you can't run most machine learning models now there are an almost infinite number of reasons why data could be missing for example if you have survey data then maybe someone declined to answer a question if you're looking at website click data then maybe some users installed privacy tools you couldn't track them if you have sensor data then maybe a sensor wasn't working or maybe a signal was too small to detect and the point is that you need to understand how your data was collected and what missing values might mean in order to decide how to best deal with it in terms of handling missing data you essentially only have two choices so you can either just ignore it that is anytime you have a missing value uh in a row of your data set you just drop that row or you can try and make up values wherever you have any kind of missing value and rather than just guessing or using a random number we're going to put a bit of science behind it and the art of sort of guessing reasonable values is called imputation now there are basically five steps for handling missing data and the first one is to standardize how the missing values are recorded and after that you can quantify how much missing data you have and this is going to help you determine what strategy you're going to use for dealing with the missing data and thirdly as mentioned before you can either delete the missing data or you can make values up that is impute the missing values fourthly you're ready to run your model or do whatever other analysis you want and finally you want to be able to check what effect those imputed values had on your model so in this tutorial we're going to cover the first three steps now because handling missing data is essentially a data manipulation technique uh you can use pandas we're going to do some of that here uh also any good machine learning framework is going to have tools for imputation so we're going to use psychic learning in this case you can also use pi carat and it's worth noting that there are a few other sort of niche python packages available for doing specific tasks with the missing data uh to manage your expectations um time series aren't going to be covered here so when you have time series data uh observations at nearby time points can be related that means that time series picked up their own set of special methods for dealing with missing data likewise if your values are missing because they exceed some kind of threshold you are doing survival analysis and that's a whole separate set of techniques also worth noting we're not going to get into multiple imputation here so that's a sort of more sophisticated technique and it allows you to estimate the uncertainty in your model due to the imputed values that you've used so we're not going to get into that here and finally it's worth noting there are different categories of missing data so there's one called missing not at random and that means that there's some sort of pattern to the missingness and that's caused by variables that aren't included in the data set so that means that imputation uh may not be valid i'll get into that in a little bit when we see the data set all right so uh let's try a case study we're going to look at a data set on mammalian sleep durations this m sleep data set um it's been around a while uh fairly popular uh i believe the data set was originally formed from both a scientific paper and wikipedia so varying levels of trustworthiness there uh it was popularized by ours ggplot2 package and python is available via plot 9. so the data that's a data frame so we're going to start by importing pandas we're going to import pandas as pd and then from plot9.data we're going to import em sleep doubly now i've added a little bit of extra code in order to demonstrate how we go about standardizing missing values so i'm not explaining this code but let's just run it and you can see the data set so here we have 83 different rows each row corresponds to uh a species of mammal and you can see as we've got name uh what they're eating there whether a carnivore or herbivore or other things i've got the iucn conservation status some metrics about how much they sleep each day brain weight body weight so you can see we've got some missing values in some of these columns and the point i made about missing knot at random that would be something like if uh the sleep cycle was affected by something that wasn't in the data set so for example it might be the location of uh the species so if they live in um in a jungle then you might we might know less about their sleep cycle so it's more likely to be missing so in that sort of case um you have to think carefully about what the question is that you're answering with your data and whether you think these um unregistered sort of variables are going to have some sort of effect uh that might buy us uh how things are missing all right so um a couple of things to note uh if we go if we look at the conservation column you can see that the missing values are recorded as unknowns that's a non-standard way of recording a missing value likewise in the sleep rem columns this is uh how much the animal dreams per day uh missing values are recorded as uh with this numeric code of minus 999 so you see if we try and take a mean of this column it's going to be complete nonsense so the first thing we have to do is deal with the non-standard missing values in these two columns so in terms of standardizing missing values it means converting these strings like unknown to a true n a or numeric cosine minus 999 to a true n a and by true n a in this case i mean numpy's not a number n a n value so pandas also has this special value for recording missing data uh which it calls n a um unfortunately uh na is not sort of widely supported even through all of pandas's different methods and of mine by other packages so it's a little bit safer to use this numpy non-value at least for now so we're going to uh import um the n a n value from uh numpy so from numpy we're going to import uh nan from that and then to standardize the missing values we're gonna use the replace method so uh we're going to uh get a new variable called m sleep and we're going to start with m sleep dirty and i'm going to call the replace method now this takes two variables so the first one is a dictionary and the keys to the dictionary are the columns where we want to replace values so not conversation conservation and the value we're going to replace from this column is unknown and then in the sleep rem column we are going to replace the value minus 999 and then the value we're going to replace these with is nan okay let's space that so let's run this and we're going to print it out as well so run this cell and you can see now if we scroll over a little bit so in this conservation column where it said unknown before it now says null and same where in sleep remember it said minus nine and it now says null now important thing to know is depending on which uh jupiter notebook um editor you're using this will sometimes be displayed as null sometimes be displayed as uh nan with uh usually with capital ends so it means the same thing it's just a different display all right so now we have standardizing values we can quantify uh these so we're going to use two different pandas methods the first one is n a that returns a true or false depending on whether a value is missing and then mean because if you take the mean of uh trune's fault valley true and false values you're going to get the proportion of true values so all we can do is start with them sleep i'm going to call is n a i'm going to chain it together with main run that so here a value of zero means no missing data so you can see all the names are present uh in the vor column uh you can see it's pretty low number zone 0.08 so that means eight percent of the data values are missing in that column with sleep cycles much higher 61 percent of values missing and if we had a value of one then that would mean all the data was missing in that column so as mentioned before the easiest thing you can do with missing data is to get rid of it so if we do m sleep i'm going to call the drop n a method and that's going to get rid of any rows where there's at least one missing value and let's see what happens so uh same columns but now you see we only have 12 rows instead of 83. so even with sort of fairly moderate levels of missing data uh it's been pretty disastrous so we've lost most of our data set and that's really not ideal so the important takeaway is that dropping data is only going to be a suitable solution if you only have a very small amount of missing data so in general you're going to have to do imputation rather than just uh ignoring the missing data now in psychic learn the way you have to deal with missing data in numeric columns um is different to the way you have to deal with um missing data in unordered categorical uh columns so we have to split the data set in two now for ordered categorical data like you see in the conservation column where it goes from uh let me see so something like least concern through to i think it goes all the way down to extinct um so there's a series of um different categories but so for this kind of ordered categorical data you can either treat it the same way as you would treat the unordered categorical data or you can do a trick where you use ordinal encoder turn those ordered categories into integers and to pretend that it's a numeric column so for simplicity we're going to take the former approach uh whichever approach you um assume there's a lot of subtlety in terms of deciding which one you're going to do it really depends on what problem you're trying to solve and what you believe about these ordered categorical columns but we're going to do the simple thing and treat all the categorical data the same way so next step split the data set into two columns by type so we're going to use the select d types method for this so we're going to have m sleep uh we'll call it num for numeric so we're going to start with m sleep and we're going to do select d types and we're going to get the float columns so float as in floating point numbers and let's get this printed so you can see in this case we've got six columns now and let's do the same with the categorical variables uh so m sleep dot select d types and this time we've got two options so uh object means it's a it's a text column and then category means it's categorical column let's just uh copy and paste this we can print it out and here you see we've got the other five columns all right so the simplest form of imputation uh so this is for numeric columns uh it's pretty stupid approach um but it's sort of okay in some cases so what you do is you take the mean of the known values in that column and then you use that value for the missing values so we've got two options here uh scikit-learn has a thing called simple imputer and in pandas is called phil n a we're going to focus on simple computer because it makes it easier to sort of transition to the other imputer types that we're going to use later uh but i think phil n a sort of explains what's going on a bit more clearly so we're going to use that to check our working all right so first thing is we need the impute submodule of psychic learn so we're going to do import sklearn dot impute and give that an let's call it si and i'm going to create a simple uh computer object let's do s i dot simple imputer run that and the way we do this imputation is we call a fit transform method so the fitting part just means calculating the means of the columns and then transforming means replacing the missing values with those means so we're going to do two steps in one go with this method um all right so let's try that now so i'm gonna do simp and then we're gonna call fit fit transform i'm gonna pass it m sleep uh numeric so if we run this you see slightly annoyingly circuit learner only really cares about numpy arrays so it's giving us a numpy array here but that's annoying to work with so we're going to convert it back to a data frame so we're going to use pd dot data frame pass that in and then we need to steal the column names from the m sleep data frame okay this should say dot columns all right um okay so we've got um a data frame again and here you see like in the previous data set the first three values of sleep cycle were all missing and now you can see the vocal this value not .44 ish uh so that's the mean of this call of the known values in this column and it's just dumped uh this uh this mean in all the missing values so uh same with brainweight the mean of this column is is 0.28 and it's put that in all the missing cases all right so let's try this again with pandas and you can really see how this works so the first thing we're going to do is calculate the mean of each column so we're going to have a variable called let's call it column means and so we're going to start with m sleep numeric and i'm going to call mean um and then so that's the fit step and then the transform step is going to be replacing uh the missing values with um with those means so we're going to start with m sleep numeric and we're going to call fill n a in order to replace the the missing values and we're going to fill it with this column means variable so that's the transform step so here you can see we've got exactly the same situation so is this like 0.4 0.44 value in the sleep cycles 0.28 in the brain weights so we've got the same answer twice so that means uh i'm hoping our code is correct um and there is a caveat to this because that's a kind of pretty stupid approach just putting the mean in there um it means it's only really appropriate to do this when you have um a small amount of missing data just like when you drop data um i have heard anecdotally that this method performs better when you have lots of features so we're talking like hundreds or thousands or maybe even more features because that means that the effect of any individual feature is pretty small i've not tried this myself i've not seen any sort of quantitative research on like how much better it gets but uh that that's uh the sort of heuristic i've heard that this only makes sense when you have lots and lots of features all right so uh that was uh the case for numeric columns so now let's take a look at categorical columns so if you've got categories then the mean doesn't make sense so one thing we can do is use the mode instead and so that means look at the most frequent value in each column so um we're going to create another simple uh computer so we're going to call this one mf for most frequent so we're doing s i dot simple imputer and this time we're going to set the strategy argument to most frequent so it's going to put the mode in this time all right so now same situations before we've got this simple computer object and we're gonna call fit uh trans form if i can type this properly no i can't um okay and we're gonna pass in this time m sleep categorical values you see again it's returning a numpy array so we're going to do that conversion to pdot data frame and we're going to borrow the columns from m sleep categorical columns plural okay so this time so for example with the four column um the most common value before was uh herbivore so it's filled in wherever we have a missing value is filled in the value herbie now again this isn't really like a perfect approach but um it's the best you can really get using psychic learn um so this is the sort of the standard approach that gets used now a more sophisticated option is to use a different computer called iterative computer so again this is just for numeric columns but what this does is it fits a predictive model in this by default it's a bayesian ridge regression but you can change uh you can change the model type so it fits it one column at a time using the other columns uh to try and predict that column and basically by repeating this process several times you get more and more missing data filled in and you get better and better predictions and eventually like it fills in all the missing values um so because this is more sophisticated it should probably be your starting point for doing most imputation with scikit-learn but one thing to note is that um as of version 1.0.2 is what i'm using here um it is considered experimental so you need to enable it before using it so to enable it uh we're going to start with uh sklearn.experimental and we're going to import a thing called enable iterative uh imputer response all right okay and then we can create our um our computer object so in this case we're doing s i dot uh iterative computer and then just as before we call the fit uh transform method i'm going to pass it m sleep uh numeric and then again it's going to give us um a numpy about so we're gonna have to do that conversion back to data frame and we're gonna get the columns from m sleep numeric columns all right so looking at the results this time you can see here that uh whereas before uh with the sleep cycle gone those first three missing values it was replaced with like 0.44 exactly the same value each time this time it's given us a slightly different value so based on the results of the other columns is predicting that the missing value is going to be slightly different in each case here so this often gives you a better performance all right so that's uh so now we've filled in all the sort of missing values uh we're ready to run a model uh so that's the end of this tutorial um imputation is such um a big complex topic that there is so much more to learn uh so datacamp in fact has uh at least three courses on this topic so i would start with dealing with missing data in python b there's also information about handling missing values in cleaning data from python and machine learning with psychic learn um pandas has its own tutorial on working with missing data so does psychic learn's got a tutorial on imputation so there's lots more to learn here but i hope this is enough to help you get started handling your missing data thanks youin this tutorial we can look at how to handle missing data that means pre-processing your data set in order to get it ready for running a machine learning model or doing other analyses before we start coding let's cover some theory first of all you might wonder why you need to care about missing data and the shorter answer is that if you don't know some of the values in your data set you can't calculate summary statistics and you can't run most machine learning models now there are an almost infinite number of reasons why data could be missing for example if you have survey data then maybe someone declined to answer a question if you're looking at website click data then maybe some users installed privacy tools you couldn't track them if you have sensor data then maybe a sensor wasn't working or maybe a signal was too small to detect and the point is that you need to understand how your data was collected and what missing values might mean in order to decide how to best deal with it in terms of handling missing data you essentially only have two choices so you can either just ignore it that is anytime you have a missing value uh in a row of your data set you just drop that row or you can try and make up values wherever you have any kind of missing value and rather than just guessing or using a random number we're going to put a bit of science behind it and the art of sort of guessing reasonable values is called imputation now there are basically five steps for handling missing data and the first one is to standardize how the missing values are recorded and after that you can quantify how much missing data you have and this is going to help you determine what strategy you're going to use for dealing with the missing data and thirdly as mentioned before you can either delete the missing data or you can make values up that is impute the missing values fourthly you're ready to run your model or do whatever other analysis you want and finally you want to be able to check what effect those imputed values had on your model so in this tutorial we're going to cover the first three steps now because handling missing data is essentially a data manipulation technique uh you can use pandas we're going to do some of that here uh also any good machine learning framework is going to have tools for imputation so we're going to use psychic learning in this case you can also use pi carat and it's worth noting that there are a few other sort of niche python packages available for doing specific tasks with the missing data uh to manage your expectations um time series aren't going to be covered here so when you have time series data uh observations at nearby time points can be related that means that time series picked up their own set of special methods for dealing with missing data likewise if your values are missing because they exceed some kind of threshold you are doing survival analysis and that's a whole separate set of techniques also worth noting we're not going to get into multiple imputation here so that's a sort of more sophisticated technique and it allows you to estimate the uncertainty in your model due to the imputed values that you've used so we're not going to get into that here and finally it's worth noting there are different categories of missing data so there's one called missing not at random and that means that there's some sort of pattern to the missingness and that's caused by variables that aren't included in the data set so that means that imputation uh may not be valid i'll get into that in a little bit when we see the data set all right so uh let's try a case study we're going to look at a data set on mammalian sleep durations this m sleep data set um it's been around a while uh fairly popular uh i believe the data set was originally formed from both a scientific paper and wikipedia so varying levels of trustworthiness there uh it was popularized by ours ggplot2 package and python is available via plot 9. so the data that's a data frame so we're going to start by importing pandas we're going to import pandas as pd and then from plot9.data we're going to import em sleep doubly now i've added a little bit of extra code in order to demonstrate how we go about standardizing missing values so i'm not explaining this code but let's just run it and you can see the data set so here we have 83 different rows each row corresponds to uh a species of mammal and you can see as we've got name uh what they're eating there whether a carnivore or herbivore or other things i've got the iucn conservation status some metrics about how much they sleep each day brain weight body weight so you can see we've got some missing values in some of these columns and the point i made about missing knot at random that would be something like if uh the sleep cycle was affected by something that wasn't in the data set so for example it might be the location of uh the species so if they live in um in a jungle then you might we might know less about their sleep cycle so it's more likely to be missing so in that sort of case um you have to think carefully about what the question is that you're answering with your data and whether you think these um unregistered sort of variables are going to have some sort of effect uh that might buy us uh how things are missing all right so um a couple of things to note uh if we go if we look at the conservation column you can see that the missing values are recorded as unknowns that's a non-standard way of recording a missing value likewise in the sleep rem columns this is uh how much the animal dreams per day uh missing values are recorded as uh with this numeric code of minus 999 so you see if we try and take a mean of this column it's going to be complete nonsense so the first thing we have to do is deal with the non-standard missing values in these two columns so in terms of standardizing missing values it means converting these strings like unknown to a true n a or numeric cosine minus 999 to a true n a and by true n a in this case i mean numpy's not a number n a n value so pandas also has this special value for recording missing data uh which it calls n a um unfortunately uh na is not sort of widely supported even through all of pandas's different methods and of mine by other packages so it's a little bit safer to use this numpy non-value at least for now so we're going to uh import um the n a n value from uh numpy so from numpy we're going to import uh nan from that and then to standardize the missing values we're gonna use the replace method so uh we're going to uh get a new variable called m sleep and we're going to start with m sleep dirty and i'm going to call the replace method now this takes two variables so the first one is a dictionary and the keys to the dictionary are the columns where we want to replace values so not conversation conservation and the value we're going to replace from this column is unknown and then in the sleep rem column we are going to replace the value minus 999 and then the value we're going to replace these with is nan okay let's space that so let's run this and we're going to print it out as well so run this cell and you can see now if we scroll over a little bit so in this conservation column where it said unknown before it now says null and same where in sleep remember it said minus nine and it now says null now important thing to know is depending on which uh jupiter notebook um editor you're using this will sometimes be displayed as null sometimes be displayed as uh nan with uh usually with capital ends so it means the same thing it's just a different display all right so now we have standardizing values we can quantify uh these so we're going to use two different pandas methods the first one is n a that returns a true or false depending on whether a value is missing and then mean because if you take the mean of uh trune's fault valley true and false values you're going to get the proportion of true values so all we can do is start with them sleep i'm going to call is n a i'm going to chain it together with main run that so here a value of zero means no missing data so you can see all the names are present uh in the vor column uh you can see it's pretty low number zone 0.08 so that means eight percent of the data values are missing in that column with sleep cycles much higher 61 percent of values missing and if we had a value of one then that would mean all the data was missing in that column so as mentioned before the easiest thing you can do with missing data is to get rid of it so if we do m sleep i'm going to call the drop n a method and that's going to get rid of any rows where there's at least one missing value and let's see what happens so uh same columns but now you see we only have 12 rows instead of 83. so even with sort of fairly moderate levels of missing data uh it's been pretty disastrous so we've lost most of our data set and that's really not ideal so the important takeaway is that dropping data is only going to be a suitable solution if you only have a very small amount of missing data so in general you're going to have to do imputation rather than just uh ignoring the missing data now in psychic learn the way you have to deal with missing data in numeric columns um is different to the way you have to deal with um missing data in unordered categorical uh columns so we have to split the data set in two now for ordered categorical data like you see in the conservation column where it goes from uh let me see so something like least concern through to i think it goes all the way down to extinct um so there's a series of um different categories but so for this kind of ordered categorical data you can either treat it the same way as you would treat the unordered categorical data or you can do a trick where you use ordinal encoder turn those ordered categories into integers and to pretend that it's a numeric column so for simplicity we're going to take the former approach uh whichever approach you um assume there's a lot of subtlety in terms of deciding which one you're going to do it really depends on what problem you're trying to solve and what you believe about these ordered categorical columns but we're going to do the simple thing and treat all the categorical data the same way so next step split the data set into two columns by type so we're going to use the select d types method for this so we're going to have m sleep uh we'll call it num for numeric so we're going to start with m sleep and we're going to do select d types and we're going to get the float columns so float as in floating point numbers and let's get this printed so you can see in this case we've got six columns now and let's do the same with the categorical variables uh so m sleep dot select d types and this time we've got two options so uh object means it's a it's a text column and then category means it's categorical column let's just uh copy and paste this we can print it out and here you see we've got the other five columns all right so the simplest form of imputation uh so this is for numeric columns uh it's pretty stupid approach um but it's sort of okay in some cases so what you do is you take the mean of the known values in that column and then you use that value for the missing values so we've got two options here uh scikit-learn has a thing called simple imputer and in pandas is called phil n a we're going to focus on simple computer because it makes it easier to sort of transition to the other imputer types that we're going to use later uh but i think phil n a sort of explains what's going on a bit more clearly so we're going to use that to check our working all right so first thing is we need the impute submodule of psychic learn so we're going to do import sklearn dot impute and give that an let's call it si and i'm going to create a simple uh computer object let's do s i dot simple imputer run that and the way we do this imputation is we call a fit transform method so the fitting part just means calculating the means of the columns and then transforming means replacing the missing values with those means so we're going to do two steps in one go with this method um all right so let's try that now so i'm gonna do simp and then we're gonna call fit fit transform i'm gonna pass it m sleep uh numeric so if we run this you see slightly annoyingly circuit learner only really cares about numpy arrays so it's giving us a numpy array here but that's annoying to work with so we're going to convert it back to a data frame so we're going to use pd dot data frame pass that in and then we need to steal the column names from the m sleep data frame okay this should say dot columns all right um okay so we've got um a data frame again and here you see like in the previous data set the first three values of sleep cycle were all missing and now you can see the vocal this value not .44 ish uh so that's the mean of this call of the known values in this column and it's just dumped uh this uh this mean in all the missing values so uh same with brainweight the mean of this column is is 0.28 and it's put that in all the missing cases all right so let's try this again with pandas and you can really see how this works so the first thing we're going to do is calculate the mean of each column so we're going to have a variable called let's call it column means and so we're going to start with m sleep numeric and i'm going to call mean um and then so that's the fit step and then the transform step is going to be replacing uh the missing values with um with those means so we're going to start with m sleep numeric and we're going to call fill n a in order to replace the the missing values and we're going to fill it with this column means variable so that's the transform step so here you can see we've got exactly the same situation so is this like 0.4 0.44 value in the sleep cycles 0.28 in the brain weights so we've got the same answer twice so that means uh i'm hoping our code is correct um and there is a caveat to this because that's a kind of pretty stupid approach just putting the mean in there um it means it's only really appropriate to do this when you have um a small amount of missing data just like when you drop data um i have heard anecdotally that this method performs better when you have lots of features so we're talking like hundreds or thousands or maybe even more features because that means that the effect of any individual feature is pretty small i've not tried this myself i've not seen any sort of quantitative research on like how much better it gets but uh that that's uh the sort of heuristic i've heard that this only makes sense when you have lots and lots of features all right so uh that was uh the case for numeric columns so now let's take a look at categorical columns so if you've got categories then the mean doesn't make sense so one thing we can do is use the mode instead and so that means look at the most frequent value in each column so um we're going to create another simple uh computer so we're going to call this one mf for most frequent so we're doing s i dot simple imputer and this time we're going to set the strategy argument to most frequent so it's going to put the mode in this time all right so now same situations before we've got this simple computer object and we're gonna call fit uh trans form if i can type this properly no i can't um okay and we're gonna pass in this time m sleep categorical values you see again it's returning a numpy array so we're going to do that conversion to pdot data frame and we're going to borrow the columns from m sleep categorical columns plural okay so this time so for example with the four column um the most common value before was uh herbivore so it's filled in wherever we have a missing value is filled in the value herbie now again this isn't really like a perfect approach but um it's the best you can really get using psychic learn um so this is the sort of the standard approach that gets used now a more sophisticated option is to use a different computer called iterative computer so again this is just for numeric columns but what this does is it fits a predictive model in this by default it's a bayesian ridge regression but you can change uh you can change the model type so it fits it one column at a time using the other columns uh to try and predict that column and basically by repeating this process several times you get more and more missing data filled in and you get better and better predictions and eventually like it fills in all the missing values um so because this is more sophisticated it should probably be your starting point for doing most imputation with scikit-learn but one thing to note is that um as of version 1.0.2 is what i'm using here um it is considered experimental so you need to enable it before using it so to enable it uh we're going to start with uh sklearn.experimental and we're going to import a thing called enable iterative uh imputer response all right okay and then we can create our um our computer object so in this case we're doing s i dot uh iterative computer and then just as before we call the fit uh transform method i'm going to pass it m sleep uh numeric and then again it's going to give us um a numpy about so we're gonna have to do that conversion back to data frame and we're gonna get the columns from m sleep numeric columns all right so looking at the results this time you can see here that uh whereas before uh with the sleep cycle gone those first three missing values it was replaced with like 0.44 exactly the same value each time this time it's given us a slightly different value so based on the results of the other columns is predicting that the missing value is going to be slightly different in each case here so this often gives you a better performance all right so that's uh so now we've filled in all the sort of missing values uh we're ready to run a model uh so that's the end of this tutorial um imputation is such um a big complex topic that there is so much more to learn uh so datacamp in fact has uh at least three courses on this topic so i would start with dealing with missing data in python b there's also information about handling missing values in cleaning data from python and machine learning with psychic learn um pandas has its own tutorial on working with missing data so does psychic learn's got a tutorial on imputation so there's lots more to learn here but i hope this is enough to help you get started handling your missing data thanks you\n"

Python Machine Learning Tutorial _ Handling Missing Data _ Databytes

Random Videos