R Tutorial - What is Scalable Data Processing

**Article: Scalable Data Processing with R**

---

### Welcome to Scalable Data Processing

Welcome to "Scalable Data Processing," where we will explore tools and techniques for handling large datasets that may exceed the capacity of your computer's memory. I'm Michael Caine, an assistant professor at Yale University, and I'm Simon R Banach, a core member and lead inventive scientist at AT&T Labs Research.

In today’s data-driven world, big data has become integral to virtually every field. As a result, you may encounter the need to analyze increasingly large datasets. In this course, we will guide you through methods for cleaning, processing, and analyzing data that might be too large for your computer's memory.

---

### The Need for Scalability

The approaches we will teach are scalable because they can work on discrete blocks of data and are easily parallelized. Scalable code allows you to make better use of available computing resources and enables you to add more resources as needed. This is particularly useful when dealing with datasets that are too large to fit into your computer's memory.

When you create a vector, dataframe, list, or environment in R, the data is stored in RAM (Random Access Memory). Modern personal computers typically have between 8 and 16 gigabytes of RAM. However, datasets can often be much larger than this. At AT&T Labs, for example, we regularly process hundreds of terabytes of data.

While this course won't involve datasets as large as those at AT&T Labs, we will introduce you to tools and techniques that can handle very large datasets—ones that may exceed your computer's memory but are still within the limits of your hard drive space.

---

### Limitations of Working with Large Datasets in R

According to our installation and administration manual, R is not well-suited for working with data larger than 10–20% of your computer’s RAM. When your computer runs out of RAM, it moves unprocessed data to the disk temporarily—a process called swapping. If there isn't enough space to swap, the computer may crash.

Swapping can significantly increase execution time because disks are much slower than RAM. This bottleneck can make even simple operations take much longer than expected.

---

### Scalable Solutions for Processing Large Data

To address these challenges, scalable solutions involve loading subsets of data into RAM, processing them, and then discarding the raw data while keeping only the results or writing them to disk. This approach is often orders of magnitude faster than relying on swapping and can be used in conjunction with parallel or distributed processing for even faster execution times.

If your computation involves complex operations—such as fitting a random forest model—execution time may increase further due to the complexity of each operation. However, by carefully considering both read/write operations and the complexity of the tasks you want to perform, you can reduce bottlenecks and make better use of available resources.

---

### Benchmarking Performance

In our first set of exercises, we will benchmark the performance of read and write operations as well as the complexity of various data processing tasks. For example, using R’s `microbenchmark` package, we can compare the runtime of two expressions:

1. Creating a vector of 100 random normals.

2. Creating a vector of 10,000 random normals.

The `microbenchmark` function returns a summary of the distribution of run times for these two expressions. In this example, the mean runtime for the second expression (creating a vector of 10,000 random normals) is about 20 times that of the first.

---

### Your Turn to Practice

Now it’s your turn to apply what you’ve learned in practice. Try benchmarking different operations and exploring how scalability can be implemented in R for handling large datasets. Remember to consider both the size and complexity of your data when choosing the appropriate techniques.

By mastering these scalable approaches, you’ll be better equipped to handle big data challenges and optimize your computing resources effectively.

---

This concludes our introduction to scalable data processing with R. Stay tuned for more exercises and insights into working with large datasets!

"WEBVTTKind: captionsLanguage: enwelcome to scalable data processing in our I'm Michael Caine an assistant professor at Yale University and I'm Simon R Banach and our core member and lead inventive scientist at 18 T labs research with the advent of big data in every field you may need to analyze increasingly large datasets in this class we'll teach you some of the tools and techniques for cleaning processing and analyzing data that may be too large for your computer the approaches we'll show you are scalable in the sense that they are easily paralyzed and work on discrete blocks of data scalable code lets you make better use of available computing resources and allows you to use more resources as they become available when you create a vector data frame list or environment in our it is stored in the computer memory RAM Ram stands for random access memory it is where our keeps variables it creates modern personal computers usually have between 8 and 16 gigabytes of RAM data sets can be much bigger than this though at AT&T Labs we routinely process hundreds of terabytes of data in our in this course we won't process datasets that big but we will show you some of the tools we use to create analyses that can be run on datasets that may be too big for your computer's memory but less than the amount of hard drive space you have available according to our installation and administration manual our is not well-suited for working with data larger than 10 to 20% of computers Ram when your computer runs out of RAM data that is not being processed is moved to the disk until it is needed again this process is called swapping if there is not enough space to swap the computer made simply crash since the disk is much smaller than Ram this can cause the execution time which is the time needed to perform an operation to be much longer than expected the scalable solutions will show here move subsets of data into RAM process them and then discard them keeping the result or writing it to the disk this is often orders of magnitude faster than letting the computer do it and it can be used in conjunction parallel processing or even distributed processing for faster execution times for large data if the computation you want to perform is complex meaning involves many operations each of which takes a long time to perform this can also contribute to execution time summaries tables and other descriptive statistics are much easier to compute compared to tasks like fitting a random forest by carefully considering both the read and write operations and the complexity of the operations you want to perform on the data you have you'll be able to reduce the effect of these bottlenecks and make better use of the resources you have in the first set of exercises we are going to benchmark read and write performance as well as the complexity of a few different operations in our using the micro benchmark package a simple example is shown here we use the micro benchmark function to benchmark the runtime of two different expressions the first one creates a vector of hundred random normals the second expression creates a vector of 10,000 random normals the function returns a summary of the distribution of the run times for the two different expressions the mean runtime for the second expression is about 20 times that of the first your turn to practicewelcome to scalable data processing in our I'm Michael Caine an assistant professor at Yale University and I'm Simon R Banach and our core member and lead inventive scientist at 18 T labs research with the advent of big data in every field you may need to analyze increasingly large datasets in this class we'll teach you some of the tools and techniques for cleaning processing and analyzing data that may be too large for your computer the approaches we'll show you are scalable in the sense that they are easily paralyzed and work on discrete blocks of data scalable code lets you make better use of available computing resources and allows you to use more resources as they become available when you create a vector data frame list or environment in our it is stored in the computer memory RAM Ram stands for random access memory it is where our keeps variables it creates modern personal computers usually have between 8 and 16 gigabytes of RAM data sets can be much bigger than this though at AT&T Labs we routinely process hundreds of terabytes of data in our in this course we won't process datasets that big but we will show you some of the tools we use to create analyses that can be run on datasets that may be too big for your computer's memory but less than the amount of hard drive space you have available according to our installation and administration manual our is not well-suited for working with data larger than 10 to 20% of computers Ram when your computer runs out of RAM data that is not being processed is moved to the disk until it is needed again this process is called swapping if there is not enough space to swap the computer made simply crash since the disk is much smaller than Ram this can cause the execution time which is the time needed to perform an operation to be much longer than expected the scalable solutions will show here move subsets of data into RAM process them and then discard them keeping the result or writing it to the disk this is often orders of magnitude faster than letting the computer do it and it can be used in conjunction parallel processing or even distributed processing for faster execution times for large data if the computation you want to perform is complex meaning involves many operations each of which takes a long time to perform this can also contribute to execution time summaries tables and other descriptive statistics are much easier to compute compared to tasks like fitting a random forest by carefully considering both the read and write operations and the complexity of the operations you want to perform on the data you have you'll be able to reduce the effect of these bottlenecks and make better use of the resources you have in the first set of exercises we are going to benchmark read and write performance as well as the complexity of a few different operations in our using the micro benchmark package a simple example is shown here we use the micro benchmark function to benchmark the runtime of two different expressions the first one creates a vector of hundred random normals the second expression creates a vector of 10,000 random normals the function returns a summary of the distribution of the run times for the two different expressions the mean runtime for the second expression is about 20 times that of the first your turn to practice\n"