Python Tutorial - Stanford Open Policing Project dataset

Welcome to Data Analysis with Pandas: Preparing for Your Career

My name is Kevin Markham and I'll be your instructor for this course. As a scientist and the founder of Data School, I'll be guiding you through a practical approach to learning pandas, a powerful library in Python for data manipulation and analysis. In this course, you'll have the opportunity to practice what you've learned about pandas by working with a real dataset. You'll gain valuable experience analyzing a dataset from start to finish, which will help prepare you for your career in data science.

The Dataset: Traffic Stops by Police Officers

You'll be working with a dataset of traffic stops by police officers that was collected by the Stanford Open Policing Project. The dataset has been compiled from 31 US states, but in this course, we'll be focusing on data from the state of Rhode Island due to size reasons. Some columns and rows have been removed for simplicity, but you can download the full dataset for any of the 31 states from the project's website.

Preparation is Key

Before beginning an analysis, it's critical that you first examine the data to make sure that you understand it and then clean the data to make working with it a more efficient process. We'll start by importing pandas as PD. We use the read CSV function to read in the dataset from a file and store it in a DataFrame called RI, which stands for Rhode Island.

Taking a Quick Glance

We'll use the head method to take a quick glance at the data frame. Although there are many more columns than can fit on this screen, each row represents a single traffic stop. You'll notice that the county name column contains NaN values, which indicate missing values. These are often values that were not collected during the data gathering process or are irrelevant for that particular row.

Locating Missing Values

It's essential to locate missing values so that you can proactively decide how to handle them. You may recall that the isnull method generates a DataFrame of true and false values, where true indicates missing values and false indicates non-missing values. One useful trick is to take the sum of this DataFrame, which outputs a count of the number of missing values in each column.

How Does it Work?

The sum method calculates the sum of each column by default and treats true values as ones while false values are treated as zeros. Let's compare these missing value counts to the data frame's shape. You'll notice that the county name column contains as many missing values as there are rows, meaning that it only contains missing values since it contains no useful information.

Dropping Columns

This column can be dropped using the drop method. Besides specifying the column name, you need to specify that you're dropping from the columns axis and that you want the operation to occur in place, which avoids an assignment statement. Finally, let's take a look at one more method related to missing values.

Dropping Rows

The dropna method is a great way to drop rows based on the presence of missing values in that row. For example, let's pretend that the stop date and stop time columns are critical to our analysis and thus a row is useless to us without that data. We can tell pandas to drop all rows that have a missing value in either the stop date or stop time column because we specified a subset.

Your Turn to Practice

Now it's your turn to practice using these functions to examine and clean this dataset. Remember, preparation is key to successful data analysis, and with pandas, you'll be well-equipped to handle even the most complex datasets. Stay tuned for further instructions and guidance as we continue through this course together.

"WEBVTTKind: captionsLanguage: enhi my name is Kevin Markham and I'll be your instructor for this course I'm indeed a scientist and the founder of data school in this course you'll be practicing a lot of what you've learned about pandas all ready to answer interesting questions about a real data set you'll gain valuable experience analyzing a data set from start to finish which will help to prepare you for your data science career let's start by introducing the data you'll be working with the data set of traffic stops by police officers that was collected by the Stanford open policing project they've collected data from 31 US states but in this course you'll be focusing on data from the state of Rhode Island for size reasons some of the columns and rows have been removed but you can download the full data set for any of the 31 states from the project's website this first chapter is about preparing the data for analysis before beginning an analysis it's critical that you first examine the data to make sure that you understand it and then clean the data to make working with it a more efficient process as always we'll start by importing pandas as PD we use the read CSV function to read in the data set from a file and then store it in a data frame called RI which stands for Rhode Island we'll use the head method in order to take a quick glance the data frame though there are many more columns than can fit on this screen each row represents a single traffic stop you'll notice that the county name column contains n a n values which indicate missing values these are often values that were not collected during the data gathering process or are irrelevant for that particular row it's important that you locate missing values so that you can proactively decide how to handle them you may recall that the is null method generates a data frame of true and false values true if the element is missing and false if it's not one useful trick is to take the sum of this data frame which outputs a count of the number of missing values in each column how does that calculation work well the some method calculates the sum of each column by default and true values are treated as ones while false values are treated as zeros let's compare these missing value counts to the data frames shape you'll notice that the county name column contains as many missing values as there are rows meaning that it only contains missing values since it contains no useful information this column can be dropped using the drop method besides specifying the column name you need to specify that you're dropping from the columns axis and that you want the operation to occur in place which avoids an assignment statement finally let's take a look at one more method related to missing values the drop and a method is a great way to drop rows based on the presence of missing values in that row for example let's pretend that the stop date and stop time columns are critical to our analysis and thus a row is useless to us without that data we can tell pandas to drop all rows that have a missing value in either the stop date or stop time column because we specified a subset the drop in a method only takes these two columns into account when deciding which rows to drop now it's your turn to practice using these functions to examine and clean at this data sethi my name is Kevin Markham and I'll be your instructor for this course I'm indeed a scientist and the founder of data school in this course you'll be practicing a lot of what you've learned about pandas all ready to answer interesting questions about a real data set you'll gain valuable experience analyzing a data set from start to finish which will help to prepare you for your data science career let's start by introducing the data you'll be working with the data set of traffic stops by police officers that was collected by the Stanford open policing project they've collected data from 31 US states but in this course you'll be focusing on data from the state of Rhode Island for size reasons some of the columns and rows have been removed but you can download the full data set for any of the 31 states from the project's website this first chapter is about preparing the data for analysis before beginning an analysis it's critical that you first examine the data to make sure that you understand it and then clean the data to make working with it a more efficient process as always we'll start by importing pandas as PD we use the read CSV function to read in the data set from a file and then store it in a data frame called RI which stands for Rhode Island we'll use the head method in order to take a quick glance the data frame though there are many more columns than can fit on this screen each row represents a single traffic stop you'll notice that the county name column contains n a n values which indicate missing values these are often values that were not collected during the data gathering process or are irrelevant for that particular row it's important that you locate missing values so that you can proactively decide how to handle them you may recall that the is null method generates a data frame of true and false values true if the element is missing and false if it's not one useful trick is to take the sum of this data frame which outputs a count of the number of missing values in each column how does that calculation work well the some method calculates the sum of each column by default and true values are treated as ones while false values are treated as zeros let's compare these missing value counts to the data frames shape you'll notice that the county name column contains as many missing values as there are rows meaning that it only contains missing values since it contains no useful information this column can be dropped using the drop method besides specifying the column name you need to specify that you're dropping from the columns axis and that you want the operation to occur in place which avoids an assignment statement finally let's take a look at one more method related to missing values the drop and a method is a great way to drop rows based on the presence of missing values in that row for example let's pretend that the stop date and stop time columns are critical to our analysis and thus a row is useless to us without that data we can tell pandas to drop all rows that have a missing value in either the stop date or stop time column because we specified a subset the drop in a method only takes these two columns into account when deciding which rows to drop now it's your turn to practice using these functions to examine and clean at this data set\n"