PySpark Tutorial - Intro to data cleaning with Apache Spark

Welcome to Data Cleaning in Apache Spark with Python

My name is Mike Metzger, and I'm a data engineering consultant. I'll be your instructor for this course, where we will cover what data cleaning is, why it's important, and how to implement it with spark and Python. Let's get started in this course.

Data Cleaning: A Necessary Part of Any Production Data System

We will define data cleaning as preparing raw data for use and processing pipelines. We'll discuss what a pipeline is later on, but for now, it's sufficient to say that the data cleaning is a necessary part of any production data system. If your data isn't clean, it's not trustworthy and can cause problems later on.

There are many tasks that could fall under the data cleaning umbrella. A few of these include reformatting or replacing text, performing calculations based on the data, and removing garbage or incomplete data. Most data claiming systems have two big problems: optimizing performance and organizing the full of data. A typical programming language such as Perl, C++, or even standard sequel may be able to clean data when you have small quantities of data.

However, consider what happens when you have millions or even billions of pieces of data. Those languages wouldn't be able to process that amount of information in a timely manner. Sparkle lets you scale your data processing capacity as your requirements evolve beyond the performance issues. Dealing with large quantities of data requires a process or pipeline of steps.

SPARC allows management of many complex tasks with an SQL framework. Here's an example of cleaning a small data set. We're given a table of names, age, and city. Our requirements are for a data frame with first and last name in separate columns, the age and months, and which state the city is in. We also want to remove any rows where the data is out of the ordinary.

Using spark transformations, we can create a data frame with these properties and continue processing afterwards. A primary function of data cleaning is to verify all data is in the expected format. Spark provides built-in ability to validate data sets with schemas. You may have used schemas before with databases or XML. Spark is similar. A schema defines and validates the number and types of columns for a given data frame.

A schema can contain many different types of data: integers, floats, dates, strings, and even arrays or mapping structures. I defined a schema that allows spark to filter up data that doesn't conform during read, ensuring expected correctness. In addition, schemas also have performance benefits. Normally, a data import will try to infer a schema on read this requires reading the data twice. Defining a schema limits this to a single read operation.

Here is an example of how we can apply a schema to our data. First, we'll import the Pais bar nuts equal types library. Next, we'll define the actual struck type list of struct fields containing an entry for each field in the data. Each struck field consists of a field named "data type" and whether the data can be null.

Once our schema is defined, we can add it into our sport reader format a load call and process it against our data. The load method takes two arguments: the file name and a schema. This is where we apply our schema to the data being loaded.

"WEBVTTKind: captionsLanguage: enwelcome to data cleaning in Apache spark with Python my name is Mike Metzger I'm a data engineering consultant and I will be your instructor for this course we will cover what data cleaning is why it's important and how to implement it with spark and Python let's get started in this course we'll define data cleaning as preparing raw data for use and processing pipelines we'll discuss what a pipeline is later on but for now it's sufficient to say the data cleaning is a necessary part of any production data system if your data isn't clean it's not trustworthy and cause problems later on there are many tasks that could fall under the data cleaning umbrella a few of these include reformatting or replacing text performing calculations based on the data and removing garbage or incomplete data most data claiming systems have two big problems optimizing performance and organizing the full of data a typical programming language such as Perl C++ or even standard sequel may be able to clean data when you have small quantities of data but consider what happens when you have millions or even billions of pieces of data those languages wouldn't be able to process that amount of information in a timely manner Sparkle lets you scale your data processing capacity as your requirements evolve beyond the performance issues dealing with large quantities of data requires a process or pipeline of steps SPARC allows management of many complex tasks with an SQL framework here's an example of cleaning a small data set we're given a table of names aging years and a city our requirements are for a data frame with first and last name in separate columns the age and months and which state the city is in we also want to remove any rows where the data is out of the ordinary using spark transformations we can create a data frame with these properties and continue processing afterwards a primary function of data cleaning is to verify all data is in the expected format spark provides built-in ability to validate data sets with schemas you may have used schemas before with databases or XML spark is similar a schema defines and validates the number and types of columns for a given data frame a schema can contain many different types of data integers floats dates strings and even arrays or mapping structures i defined schema allow spark to filter up data that doesn't conform during read ensuring expected correctness in addition schemas also have performance benefits normally a data import will try to infer a schema on read this requires reading the data twice defining a schema limits this to a single read operation here is an example schema to import the data from our previous example first we'll import the Pais bar nuts equal the types library next we'll define the actual struck type list of struct fields containing an entry for each field in the data each struck field consists of a field named data type and whether the data can be null once our schema is defined we can add it into our sport reader format a load call and process it against our data the load method takes two arguments the file name and a schema this is where we apply our schema to the data being loaded we've got over a lot of information regarding data cleaning and the importance of data frame schemas let's put that informationwelcome to data cleaning in Apache spark with Python my name is Mike Metzger I'm a data engineering consultant and I will be your instructor for this course we will cover what data cleaning is why it's important and how to implement it with spark and Python let's get started in this course we'll define data cleaning as preparing raw data for use and processing pipelines we'll discuss what a pipeline is later on but for now it's sufficient to say the data cleaning is a necessary part of any production data system if your data isn't clean it's not trustworthy and cause problems later on there are many tasks that could fall under the data cleaning umbrella a few of these include reformatting or replacing text performing calculations based on the data and removing garbage or incomplete data most data claiming systems have two big problems optimizing performance and organizing the full of data a typical programming language such as Perl C++ or even standard sequel may be able to clean data when you have small quantities of data but consider what happens when you have millions or even billions of pieces of data those languages wouldn't be able to process that amount of information in a timely manner Sparkle lets you scale your data processing capacity as your requirements evolve beyond the performance issues dealing with large quantities of data requires a process or pipeline of steps SPARC allows management of many complex tasks with an SQL framework here's an example of cleaning a small data set we're given a table of names aging years and a city our requirements are for a data frame with first and last name in separate columns the age and months and which state the city is in we also want to remove any rows where the data is out of the ordinary using spark transformations we can create a data frame with these properties and continue processing afterwards a primary function of data cleaning is to verify all data is in the expected format spark provides built-in ability to validate data sets with schemas you may have used schemas before with databases or XML spark is similar a schema defines and validates the number and types of columns for a given data frame a schema can contain many different types of data integers floats dates strings and even arrays or mapping structures i defined schema allow spark to filter up data that doesn't conform during read ensuring expected correctness in addition schemas also have performance benefits normally a data import will try to infer a schema on read this requires reading the data twice defining a schema limits this to a single read operation here is an example schema to import the data from our previous example first we'll import the Pais bar nuts equal the types library next we'll define the actual struck type list of struct fields containing an entry for each field in the data each struck field consists of a field named data type and whether the data can be null once our schema is defined we can add it into our sport reader format a load call and process it against our data the load method takes two arguments the file name and a schema this is where we apply our schema to the data being loaded we've got over a lot of information regarding data cleaning and the importance of data frame schemas let's put that information\n"