How to handle missing data with Pandas

**Dropping Missing Data in Python**

One of the easiest ways to handle missing data is by using the `dropna()` function in pandas. This function allows you to drop rows or columns that contain missing values.

To use this function, you simply need to select the row or column you want to drop and then call `dropna()`. For example, if you have a DataFrame called `df` and you want to drop all rows that contain missing values, you can do so by calling `df.dropna()`. This will return a new DataFrame with all rows containing missing values removed.

For instance, let's say we have a DataFrame called `df` with 333 roles. We can use the `dropna()` function to drop all rows that contain at least one missing value. By doing so, we would be left with only 325 rows in our DataFrame because there are eight missing values in total.

To implement this, you would use the following code:

```python

import pandas as pd

# create a sample DataFrame

df = pd.DataFrame({

'role': ['1', '2', '3', '4', '5', '6', '7', '8'],

'species': ['A', None, 'C', 'D', 'E', 'F', 'G', 'H']

})

# drop rows with missing values

df = df.dropna()

print(df)

```

In this example, we first create a sample DataFrame called `df`. We then use the `dropna()` function to remove all rows that contain at least one missing value. Finally, we print the resulting DataFrame to see the updated number of rows.

**Handling Missing Data**

Another way to handle missing data is by replacing it with a new value. This can be useful if you want to impute missing values in a specific way.

For example, let's say you have a numerical column called `salary` and you want to replace all missing values with the mean of the column. You can do this using the following code:

```python

import pandas as pd

# create a sample DataFrame

df = pd.DataFrame({

'role': ['1', '2', '3', '4', '5', '6', '7', '8'],

'salary': [50000, None, 60000, 70000, 80000, 90000, 100000, 110000]

})

# replace missing values with the mean of the column

df['salary'] = df['salary'].fillna(df['salary'].mean())

print(df)

```

In this example, we first create a sample DataFrame called `df`. We then use the `fillna()` function to replace all missing values in the `salary` column with the mean of the column. Finally, we print the resulting DataFrame to see the updated values.

**Saving the Updated DataFrame**

When you make changes to a DataFrame using functions like `dropna()` or `fillna()`, the changes are not automatically saved. To save the updated DataFrame, you need to assign it back to the original variable or create a new variable with the updated data.

For example, if you use the `inplace=True` parameter when calling `dropna()`, the changes will be made directly to the original DataFrame and can be accessed without needing to reassign it. However, if you don't specify this parameter, the function will return a new DataFrame with the updated values, which you need to assign back to the original variable.

To illustrate this, let's consider an example where we want to use `inplace=True` when calling `dropna()`. We can do so by modifying the following code:

```python

import pandas as pd

# create a sample DataFrame

df = pd.DataFrame({

'role': ['1', '2', '3', '4', '5', '6', '7', '8'],

'species': ['A', None, 'C', 'D', 'E', 'F', 'G', 'H']

})

# drop rows with missing values

df.dropna(inplace=True)

print(df)

```

In this example, we use the `inplace=True` parameter when calling `dropna()`, which means that the changes will be made directly to the original DataFrame. As a result, when we print the resulting DataFrame, we can see that it has been updated correctly.

On the other hand, if we don't specify the `inplace=True` parameter, the function will return a new DataFrame with the updated values, which we need to assign back to the original variable:

```python

import pandas as pd

# create a sample DataFrame

df = pd.DataFrame({

'role': ['1', '2', '3', '4', '5', '6', '7', '8'],

'species': ['A', None, 'C', 'D', 'E', 'F', 'G', 'H']

})

# drop rows with missing values

new_df = df.dropna()

print(new_df)

```

In this example, we assign the result of `dropna()` to a new variable called `new_df`. As a result, when we print `new_df`, we can see that it has been updated correctly.

**Conclusion**

Dropping missing data is an essential step in handling missing values in pandas DataFrames. By using functions like `dropna()`, you can easily remove rows or columns that contain missing values from your DataFrame. Additionally, replacing missing values with new values can be useful if you want to impute missing values in a specific way. To save the updated DataFrame, you need to assign it back to the original variable or create a new variable with the updated data.

"WEBVTTKind: captionsLanguage: enIn this video, we're going toexplore how to handle missingdata in pandas and without furtherado, we're starting right now.So the first thing that you wantto do is to import pandas as PD.And the data set that we're goingto be using in this tutorialwill be the penguins data set.And so it is available hereat the Data Professor GitHub.And so let's have a lookinside the code cell here.So we're going to use the PD dot, readCSV function in order to read the contentsof the CSV file, which is located here.And so here we could just copy andpaste the, GitHub hosted penguinsdataset CSV and then we could run it.And then the resulting contentwill be assigned to the DF.And now let's have a look inside theDF variable, and then you're goingto see the DataFrame listed here.So let's scroll down a bit and you'regoing to see that the number of rolesis 333 and that there are seven columns.And so it should be noted here that thisparticular data set has been cleanedby myself a while back, and I've hostedthis on the GitHub of the Data Professor.And therefore, in order to allow usto handle missing data, we're nowgoing to add missing data into this.So this is for another tutorial.If you're going to readit in using Excel file.So we're just going to delete this.Alright, here you go.So in pandas you couldselect missing values.Or you could also select the inverse ofthat, which is the non missing values.And so in order to select themissing values, we're goingto use the isnull function.And then we're going to combinethat with the sum function.And so what this will do isit will take the DF DataFrame.And then it will apply the functionisnull, which will compute theBoolean values for each cell inthe data frame to determine whetherthey have a missing value or not.If there is a missing value, then thebullying value will return as true.However, if there is no missing value,it will return as False and then based onthe resulting DataFrame, you could thenuse the sum function in order to do acount of all columns and count how manymissing values are in each of the column.And so the results here show us thatthere is no missing value because Ihave already mentioned that I havealready cleaned this a while back.And therefore we're going to addmissing values to the datasets.And so in order to do that,we're going to use the iloc orthe index location function.Like for example, if we specify DF,which is the data frame that we wantto apply it to dot iloc, and then wespecify the numbers, 0 and 0, whichmeans the first number is the row.Which has the index numberof 0, which is the first row.And then the second numberhere indicates the column.And so it indicates the firstcolumn and therefore it is thefirst row and the first column,which returns the value of Adele.So let me show you thecontents of the DF data frame.And so you could see that the first rowand the first column has a value of Adele.And so in the following blockof code here, we're going tomanually add missing values to it.And so we're going to have it forseven particular locations, andthen we're going to use PD dot NAand so NA is the missing value.And therefore the first one is atposition 0, 0, which is right here.And then at position 5 and 1, whichmeans row number 6, because theindex is 5 and then one is islandcolumn here, and then we're goingto have for index row number 10.So it will be 0, 1, 2,3, 4, 5, 6, 7, 8, 9, 10.And then column number 2.Meaning the index locationtwo, we said 0 1, 2.So it is this particular column and thenit's going iterate through, for the rest.So we're going to add a totalof 7 missing values manually.So let's have a look,let's now do the check.So in order to do the check for themissing value or using the isnulland therefore it will return avalue of true, as I have alreadymentioned, if there is a missing value.And so, because we have alreadyused the position, there waszero to add the missing value.Therefore it returns to be True.However, it is not so intuitive tolook through the entire DataFrame hereto see which one is True or False.Right.So there must be a better way, but I'mgoing to show you that in just a moment.So we can see here that we usedisnull function and there is anotherfunction called isna and so itwill give you the exact same results.So either isnull or isna will giveyou the results of the missing value.And so in order to display only the rowsthat contains at least one missing value,we could use the following line of code.So, what we're doing here is we'regoing to use the DF and then we'regoing to the bracket here, which isthe selection or the slicing operator.And then inside here, we'regoing to specify DF dot S now.So it will determine the missingvalue as mentioned here, and thenwe're going to use the any function.And then ask input argument.We're going to use axis equals toone axis when it is equal to one,it means that we're specifying thecolumns and access equal to zero.It means that we're specifying the roles.And so what this would do is foreach column, it will determinewhether there is a missing value.And if there is, then it will displaythe particular role that's shown here.NA NA.And so we're going to see that for eachcolumn it has one missing value as towe have already performed up at thetop here, right here for each of thecolumns, we have a missing value.added.And so this one will display onlythe roles with the missing value.And then we're going to do the functionagain, where we combine the isnullfunction together with the sum function.And so we're going to do thesummation for each column.How many missing values are there?And because each of them has one missingvalue, then it shows all of them.So what if I add another one?Let's see.What if I add four column withthe index location of six?Let me add another one.We're going to add here.Row number 40 index ID of 40, andthen we're going to have it forcolumn with index number of six.And then let's see it again here.And so, number six here will be the sexcolumn and therefore you're going to seethat it will have now two missing values.So let's run it.There you go.Now there is two missing values becauseup at the top here we had explicitlymanually added two missing valuesfor the sex column.right here Right.And so let's move further.And what about the opposite function?The non-missing value.So instead of using isna, we could usenotna, or there is another opposite,which isnull, could be notnull.And so they're the exact opposite.So at the top you could see that.If there is a missingvalue, it will be True.And then the rest will be False.But then for this particular function,notna, you're going to have the missingvalue to be False because this isdoing a check for non-missing value.And so if there is non missing value,it will return the value of True whileif there is a missing value, it willreturn a value.of False All right.And so when we have missingvalues or missing data, howare we going to handle that?So you have two options here.You can either drop the missingdata, or you could fill itin with a replacement value.So let me show you how youcould drop the missing data.So this might be the easiest way,because if there is any missing value,you will drop it, as simple as that.Alright, and so it's very simple.You just use DF dot drop N a and thenall of the missing data will be dropped.And so what it will do is thatit will drop roles that haveat least one missing value.And so at the beginningyou have 333 roles.And therefore we're droppingall of the missing value.Therefore we have 325 roles nowbecause there were let's count here.We have 1, 2, 3, 4, 5, 6, 7, 8.So there's a total of eight missing value.And the original data here has 333.And if you take 333 minus 8, you get 325,which is the number shown below here.Let me show you right here, 325 rows.Right?So the point of note here is thatwhenever we use the functions,DF dot dropna and then we takea look at the DF variable again.We could see that it's still has 333 rows.So what happened here?Didn't we just dropped it, butthen when we return it again, ithas the original number of 333.And the thing lies in the fact thatalthough we dropped the missing data,but then we haven't saved it yet.And so in order to save this, youwould have to do something like this.Okay.But instead of doing that, or you couldeither give it a new name like this,and then you could continue with df2and another way is to use the inplaceequals to True so what does this do?Is that when you run it and then youtake a look at the original DataFrame,you will see now that it has implementedthe dropping of the missing data.Okay.So whenever you have in place equalsto True as the input argument to thedropna function, it will perform thedropping on the original data frame.However, in the above here, whenwe did not use any input argument,and then we just use it by default,it will just display the DataFramethat it has implemented the droppingfunction, but then it did not save it.Okay.So it just shows you here, butthen it did not save anything.And so in order to save the droppedDataFrame you will need to useinplace equals to True, or you wouldneed to assign it to a new variable.All right.And so let's continue.And so aside from dropping themissing data, we could now alsodo replace the missing data.And so what you will essentially do iswhenever there is missing data, you couldfill it in with a replacement value.And so let's take a look here.You could replace the missingvalue with an arbitrary number.So this could be anything.It could be 0.It could be -1.Or it could be any othervalues of your interest.So here let's do 0Okay.But we have already dropped it.So let's, let's add it again.Let's add the missing value again,because we had already cleaned the data.So let's.Have missing data again, andnow let's implement the fillna.Okay, let's do that.And now you're going to seethat all of the missing datawill now have a value of 0.It might be okay for the numericalcolumns, but then it did notwork for the species column.Okay.So you might need to do a separately.You might need to fillna forthe particular column, like,for example, you could do this.You could have, let meadd another code cell.You could do DF dot speciesdot fillna and then you couldhave like, something like that.Okay.So you have it alreadyadded here the dash.Okay.So in order to assign this, you wouldneed to say inplace equals T rue.And then when we take a look at the DFDataFrame again, you should have righthere, the data that you had filledin, is it, which is the dash here.So what if you want to fill in themissing value with the mean of the column?So it will work for allof the numerical columns.So whenever there is a missing value,it will replace a missing valuewith the columns' mean if you use itlike this, if you specify the DF dotmean to be the input argument field.And then you could go back andmanually impute the missing valuefor the categorical columns.And so if this video was helpful,please support the channel bysmashing the light button subscribing.If you haven't already.And also hit on the notificationbell so that you will benotified of the next video.And as always the best way to learndata science is to do data scienceand please enjoy the journey.\n"