Python Tutorial - School Budgeting with Machine Learning in Python

Hello Data Campers: An Introduction to Our Course

Hello data campers, we're really glad you've decided to join us for this course. We have an exciting journey ahead of us through some real data and some incredibly useful tips and tricks from expert data scientists. I'm Peter Bull, a data scientist and co-founder of Driven Data. Our mission is to bring the power of data science to social impact organizations. One of the ways we do that is by running online data science challenges for nonprofits, NGOs, and social enterprises. Our challenges are a global community of data scientists like you competing to solve a particular problem.

We'll work through one of these competitions as a case study and we'll show you how the winner achieved the best score in the course. We'll do some natural language processing, feature engineering, and boost our computational efficiency in addition to these Pro tips. We'll also look at one of the ways in which we can use data to have a social impact.

School Budgets in the United States: A Complex Problem

School budgets in the United States are incredibly complex. There are no standards for reporting how money is spent, making it difficult for schools to measure their performance. For example, are we spending more on our textbooks than neighboring schools? Is that investment worthwhile? However, to do this comparison takes hundreds of hours each year in which analysts hand categorize each line item.

Our goal is to build a machine learning algorithm that can automate that process for each line item. We have some text fields that tell us about the expense, such as a line might say something like "algebra books for eighth grade students." We also have the amount of expense in dollars. This line item then has a set of labels attached to it, for example, this one might have labels like textbooks, math, and middle school.

These labels are our target variable. This is a supervised learning problem where we want to use correctly labeled data to build an algorithm that can suggest labels for unlabeled lines. This is in contrast to an unsupervised learning problem where we don't have labels and we're using an algorithm to automatically understand which line items might go together.

For this problem, we have over 100 unique target variables that could be attached to a single line item because we want to predict a category for each line item. This is a classification problem, as opposed to a regression problem where we want to predict a numeric value for a line item. For example, predicting house prices.

Determining Categories: A Challenge

We need to determine whether this expense is for pre-kindergarten education, which is important because it has different funding sources or if there's a particular student type that this expense supports. Overall, there are nine columns with many different possible categories in each column. If you talk to the people who actually do this work, it's impossible for a human to label these line items with 100% accuracy.

To take this into account, we don't want our algorithm to just say this line item is for textbooks; we want it to say it's most likely this line is for textbooks and I'm 60% sure that it is. If it's not textbooks, I'm 30% sure it's office supplies. By making these suggestions, analysts can prioritize their time, which is called a human-in-the-loop machine learning system.

Loading the Data: The Next Step

We will predict a probability between zero and one, where the algorithm thinks this label is very unlikely for the line item and one the algorithm thinks this label is very likely. We'll take a quick break to review and then come back and talk about how to load the data.

"WEBVTTKind: captionsLanguage: enhello data campers how's it going really glad you've decided to join us for this course we have an exciting journey ahead of us through some real data and some incredibly useful tips and tricks from expert data scientists i'm peter bull a data scientist and a co-founder of driven data our mission is to bring the power of data science to social impact organizations one of the ways we do that is by running online data science challenges for nonprofits NGOs and social enterprises and our challenges a global community of data scientists like you competes to solve a particular problem we'll work through one of these competitions as a case study and we'll show you how the winner achieved the best score in the course we'll do some natural language processing some feature engineering and boost our computational efficiency in addition to these Pro tips we'll look at one of the ways in which we can use data to have a social impact school budgets in the United States are incredibly complex and there are no standards for reporting how money is spent schools want to be able to measure their performance for example are we spending more on our textbooks than neighboring schools and is that investment worthwhile however to do this comparison takes hundreds of hours each year in which analysts hand categorize each line item our goal is to build a machine learning algorithm that can automate that process for each line item we have some text fields that tell us about the expense for example a line might say something like algebra books for eighth grade students we also have the amount of expense in dollars this line item then has a set of labels attached to it for example this one might have labels like textbooks math and middle school these labels are our target variable this is a supervised learning problem where we want to use correctly labeled data to build an algorithm that can suggest labels for unlabeled lines this is in contrast to an unsupervised learning problem where we don't have labels and we're using an algorithm to automatically understand which line items might go together for this problem we have over 100 unique target variables that could be attached to a single line item because we want to predict a category for each line item this is a classification problem this is as opposed to a regression problem where we want to predict a numeric value for a line item for example predicting house prices here are some of the actual categories that we need to determine is this expense for pre-kindergarten education which is important because it has different funding sources or is there a particular student type that this expense supports overall there are nine columns with many different possible categories in each column if you talk to the people who actually do this work it's impossible for a human to label these line items with 100% accuracy to take this into account we don't want our algorithm to just say this line item is for textbooks we want it to say it's most likely this line is for textbooks and I'm 60% sure that it is if it's not textbooks I'm 30% sure it's office supplies by making these suggestions analysts can prioritize their time this is called a human in the loop machine learning system we will predict a probability between zero the algorithm thinks this label is very unlikely for the line item and one the algorithm thinks this label is very likely we'll take a quick break to review and then come back and talk about how to load the datahello data campers how's it going really glad you've decided to join us for this course we have an exciting journey ahead of us through some real data and some incredibly useful tips and tricks from expert data scientists i'm peter bull a data scientist and a co-founder of driven data our mission is to bring the power of data science to social impact organizations one of the ways we do that is by running online data science challenges for nonprofits NGOs and social enterprises and our challenges a global community of data scientists like you competes to solve a particular problem we'll work through one of these competitions as a case study and we'll show you how the winner achieved the best score in the course we'll do some natural language processing some feature engineering and boost our computational efficiency in addition to these Pro tips we'll look at one of the ways in which we can use data to have a social impact school budgets in the United States are incredibly complex and there are no standards for reporting how money is spent schools want to be able to measure their performance for example are we spending more on our textbooks than neighboring schools and is that investment worthwhile however to do this comparison takes hundreds of hours each year in which analysts hand categorize each line item our goal is to build a machine learning algorithm that can automate that process for each line item we have some text fields that tell us about the expense for example a line might say something like algebra books for eighth grade students we also have the amount of expense in dollars this line item then has a set of labels attached to it for example this one might have labels like textbooks math and middle school these labels are our target variable this is a supervised learning problem where we want to use correctly labeled data to build an algorithm that can suggest labels for unlabeled lines this is in contrast to an unsupervised learning problem where we don't have labels and we're using an algorithm to automatically understand which line items might go together for this problem we have over 100 unique target variables that could be attached to a single line item because we want to predict a category for each line item this is a classification problem this is as opposed to a regression problem where we want to predict a numeric value for a line item for example predicting house prices here are some of the actual categories that we need to determine is this expense for pre-kindergarten education which is important because it has different funding sources or is there a particular student type that this expense supports overall there are nine columns with many different possible categories in each column if you talk to the people who actually do this work it's impossible for a human to label these line items with 100% accuracy to take this into account we don't want our algorithm to just say this line item is for textbooks we want it to say it's most likely this line is for textbooks and I'm 60% sure that it is if it's not textbooks I'm 30% sure it's office supplies by making these suggestions analysts can prioritize their time this is called a human in the loop machine learning system we will predict a probability between zero the algorithm thinks this label is very unlikely for the line item and one the algorithm thinks this label is very likely we'll take a quick break to review and then come back and talk about how to load the data\n"