PySpark Tutorial - Immutability and Lazy Processing

Welcome Back: Exploring Immutability and Lazy Processing in Spark

In our previous discussion, we covered the fundamentals of data cleaning, data types, and schemas. Now, let's dive deeper into some advanced concepts that are essential for working with Spark. Today, we're going to explore two crucial aspects of Spark: immutability and lazy processing.

Immutability in Spark

-------------------

In most programming languages, variables are fully mutable. This means that their values can be changed at any given time, as long as the scope of the variable is valid. While this flexibility is beneficial, it also presents problems when dealing with concurrent components trying to modify the same data. To address these issues, many languages employ constructs like mutexes and semaphores to ensure thread safety.

However, Spark takes a different approach. Unlike typical Python variables, SPARK DataFrames are immutable while not strictly required immutability is often a component of functional programming. This design choice allows Spark to use immutable objects practically, which means that data frames are defined once and cannot be modified after initialization. If the variable name is reused, the original data is removed, assuming it's non-used elsewhere, and the variable name is reassigned to the new data.

While this might seem inefficient at first glance, immutability actually allows Spark to share data between all cluster components without worrying about concurrent data objects. This design enables Spark to perform more efficient processing and reduces the risk of errors caused by concurrent modifications.

Practical Example: Creating a Data Frame

-------------------------------------

To illustrate the concept of immutability in Spark, let's consider an example. We create a data frame from a CSV file called "voter_data.csv" and assign it to the variable named `voter_df`. Once created, we want to perform some operations on this data.

The first operation is to create a new column with full-year values present in the dataset by adding 'mm' to each entry. This transformation does not actually change the data frame; instead, it copies the original definition and assigns it to the `voter_df` variable name. We repeat this process for our second operation: dropping the original year column from the data frame.

In both cases, the original objects are destroyed, and the new data frames are returned. The original year column is now permanently gone from this instance; however, we can simply reload it into a new data frame if desired.

The Role of Lazy Processing

---------------------------

Lazy processing is another crucial concept in Spark that enables efficient performance on large datasets. In essence, lazy processing is the idea that very little actually happens until an action is performed.

In our previous example, we read a CSV file, added a new column, and deleted another without actually reading or modifying any data. Instead, we updated the instructions – aka transformations – for what Spark should do. This functionality allows Spark to perform the most efficient set of operations to get the desired result.

The Added Count Method Call: An Action in Spark

------------------------------------------------

In our example, when we call the `addedCount()` method, it classifies as an action in Spark. This means that Spark will process all these transformation operations. The added count method call is a perfect illustration of how lazy processing works in Spark.

By understanding and practicing these concepts, you'll be better equipped to work with Spark efficiently and effectively. Remember, immutability and lazy processing are two fundamental aspects of Spark's design that enable it to perform well on large datasets.

"WEBVTTKind: captionsLanguage: enwelcome back we've had a quick discussion about data cleaning data types and schemas let's move on to some further support concepts immutability and lazy processing normally in Python and most other languages variables are fully mutable the values can be changed at any given time assuming the scope with the variable is valid all very flexible this does present problems anytime there are multiple concurrent components trying to modify the same data most languages work around these issues using constructs like mutexes and semaphores etc this can add complexity especially with non-trivial programs unlike typical Python variables SPARC data frames are immutable while not strictly required immutability is often a component of functional programming we won't go into everything that implies here but understand that spark is designed to use immutable objects practically this means spark data frames are defined ones and are not modifiable after initialization if the variable name is reused the original data is removed assuming it's non used elsewhere and the variable name is reassigned to the new data while this seems inefficient and actually allows spark to share data between all cluster components it can do so without worry about concurrent data objects this is a quick example of the immutability of data frames and spark it's okay if you don't understand the actual code this example is more about the concepts of what happens first we create a data frame from a CSV file called voter data CSV this creates a new data frame definition and assigns it to the variable named voter underscore DF once created we want to do to further their operations the first is to create a full-year column by using a two-digit year present in the dataset and adding mm to each entry this does not actually change the data frame at all it copies the original definition that's the transformation and assigns it to the voter underscore DF variable name our second operation is similar now we want to drop the original year column from the data frame again this copies the definition add the transformation and reassigns the variable name to this new object the original objects are destroyed please note that the original year column is now permanently gone from this instance they're not from the underlying data for example you could simply reload it into a new data frame if desired you may be wondering how spark does this so quickly especially on large data sets spark can do this because of something called lazy processing lazy processing and spark is the idea that very little actually happens until an action is performed in our previous example we read a CSV file added in a new column and deleted another the trick is that no data was actually read added or modified they only updated the instructions aka transformations for what we wanted spark to do this functionality allows spark to perform the most efficient set of operations to get the desired result the current example is the same as the previous slide but was the added count method call this classifies as an action in spark and will process all these transformation operations these concepts can be a little tricky to grasp without some examples let's practice these ideaswelcome back we've had a quick discussion about data cleaning data types and schemas let's move on to some further support concepts immutability and lazy processing normally in Python and most other languages variables are fully mutable the values can be changed at any given time assuming the scope with the variable is valid all very flexible this does present problems anytime there are multiple concurrent components trying to modify the same data most languages work around these issues using constructs like mutexes and semaphores etc this can add complexity especially with non-trivial programs unlike typical Python variables SPARC data frames are immutable while not strictly required immutability is often a component of functional programming we won't go into everything that implies here but understand that spark is designed to use immutable objects practically this means spark data frames are defined ones and are not modifiable after initialization if the variable name is reused the original data is removed assuming it's non used elsewhere and the variable name is reassigned to the new data while this seems inefficient and actually allows spark to share data between all cluster components it can do so without worry about concurrent data objects this is a quick example of the immutability of data frames and spark it's okay if you don't understand the actual code this example is more about the concepts of what happens first we create a data frame from a CSV file called voter data CSV this creates a new data frame definition and assigns it to the variable named voter underscore DF once created we want to do to further their operations the first is to create a full-year column by using a two-digit year present in the dataset and adding mm to each entry this does not actually change the data frame at all it copies the original definition that's the transformation and assigns it to the voter underscore DF variable name our second operation is similar now we want to drop the original year column from the data frame again this copies the definition add the transformation and reassigns the variable name to this new object the original objects are destroyed please note that the original year column is now permanently gone from this instance they're not from the underlying data for example you could simply reload it into a new data frame if desired you may be wondering how spark does this so quickly especially on large data sets spark can do this because of something called lazy processing lazy processing and spark is the idea that very little actually happens until an action is performed in our previous example we read a CSV file added in a new column and deleted another the trick is that no data was actually read added or modified they only updated the instructions aka transformations for what we wanted spark to do this functionality allows spark to perform the most efficient set of operations to get the desired result the current example is the same as the previous slide but was the added count method call this classifies as an action in spark and will process all these transformation operations these concepts can be a little tricky to grasp without some examples let's practice these ideas\n"