Hello and Welcome to the Course on Parallel Computing
Hello and welcome to the course on parallel computing. My name is Hana Chef Chica Wah, and I'm a senior research scientist at the University of Washington. In this course, we assume that you are familiar with the concepts covered in the writing efficient R-code course here on Data Camp. That's right; you should already have optimized your sequential code, know how to benchmark it, and be ready to break it into multiple pieces that can run in parallel, load balance, and in a reproducible manner.
We're going to start with showing you different ways of splitting problems into pieces and what different methods and art packages are available for running your code in parallel. In Chapter two, we'll go into more detail about the core package called parallel. Chapter three will talk about two user-contributed packages that make parallel programming even easier: foreach and future. The course concludes by paying attention to an important and non-trivial subject of using random numbers in a parallel environment and reproducibility.
Splitting Problems into Pieces
When it comes to splitting problems into pieces, there are two main approaches: partitioning by task and partitioning by data. Partitioning by task means that different tasks can be performed independently, like building a house where many tasks can be done in parallel, such as plumbing, electrical work, and installing windows.
For example, in the modeling world, demographic models of birds' deaths and migration can be calculated independently thus providing inputs to a population model. Partitioning by data means that the same task is performed on different chunks of data. For instance, all windows could be installed in parallel in a house or computing tasks like summing each row of a matrix can be done in parallel where the same task is applied to different data.
The latter approach is much more common and will be the focus of our course. This way of partitioning is much more common and therefore it will be the focus of our course here.
A Simple Example: Partitioning by Data
Let's take a simple example of partitioning by data. If you have a sequence of operations like 1 + 2 + 3, and so on until 100, you can split it into multiple sums where each partial sum operates on a subset of the whole sequence and is independent of the others.
This way of partitioning is often referred to as embarrassingly parallel because it's embarrassing how easy it is to parallelize many statistical simulations. These simulations usually have this structure: first, a random number generator is initialized, then a loop is constructed in which the same function is called usually on different data often in such applications.
The data is generated as draws from some probability distribution or it may be simply a subset of a bigger data set after collecting results from each iteration the results are processed for example written to the disk or visualization is created.
"WEBVTTKind: captionsLanguage: enhello and welcome to the course on parallel computing in our my name is Hana chef chica wah and I'm a senior research scientist at the University of Washington in this course I assume that you are familiar with the concepts covered in the writing efficient r-code course here on data camp and that you already optimized your sequential code you know how to benchmark it and you are ready to break it into multiple pieces that can run in parallel load balance and in a reproducible manner we start with showing you different ways of splitting problems into pieces and what different methods and art packages are available for running your code in parallel in Chapter two we will go into more detail of the core package called parallel in Chapter three we will talk about two user-contributed packages that make parallel programming even easier namely for each and future apply the course concludes by paying attention to an important and non-trivial subject of using a random numbers in parallel environment and reproducibility also in the last chapter we will put all the concepts together in form of an example how do you split a computation program into Perl chunks there are two ways you can approach it first you can partition it by tasks for different tasks can be performed independently think of building a house there are many tasks you can perform in parallel for example plumbing electrical installing windows etc in the modeling world say a demographic model Birds deaths and migration can be bottled independently thus in parallel providing inputs to a population model the second form of partitioning is splitting a computation problem by data in this case the same task is performed on different chunks of data in the house example all windows could be installed in parallel in computing tasks you can for example compute a sum of each row of a matrix in parallel you have the same task which is the sum applied to different data which are the rows in summary if you partition a problem by task it means different tasks are applied to the same or different data if you partition by data the same task is performed on different data this way of partitioning is much more common and therefore it will be the focus of our course here is another example of partitioning by data if you have a sequence of operations say 1 + 2 + 3 and so on until 100 you can split it into multiple sums where each of the partial sums operates on a subset of the whole sequence and therefore is independent of the others thus the same task namely the sum is evaluated on different parts of the data if you have a large number of such independent tasks that have low or no communication needs such applications are often called embarrassingly parallel meaning it's embarrassing how easy it is to paralyze many statistical simulations belong to this category they usually have the following structure here in a pseudocode first a random number generator is initialized then a loop is constructed in which the same function here a myfunc is called usually on different data often in such applications the data is generated as draws from some probability distribution or it may be simply a subset of a bigger data set after collecting results from each iteration the results are processed for example written to the disk or visualization is created let's look at some real exampleshello and welcome to the course on parallel computing in our my name is Hana chef chica wah and I'm a senior research scientist at the University of Washington in this course I assume that you are familiar with the concepts covered in the writing efficient r-code course here on data camp and that you already optimized your sequential code you know how to benchmark it and you are ready to break it into multiple pieces that can run in parallel load balance and in a reproducible manner we start with showing you different ways of splitting problems into pieces and what different methods and art packages are available for running your code in parallel in Chapter two we will go into more detail of the core package called parallel in Chapter three we will talk about two user-contributed packages that make parallel programming even easier namely for each and future apply the course concludes by paying attention to an important and non-trivial subject of using a random numbers in parallel environment and reproducibility also in the last chapter we will put all the concepts together in form of an example how do you split a computation program into Perl chunks there are two ways you can approach it first you can partition it by tasks for different tasks can be performed independently think of building a house there are many tasks you can perform in parallel for example plumbing electrical installing windows etc in the modeling world say a demographic model Birds deaths and migration can be bottled independently thus in parallel providing inputs to a population model the second form of partitioning is splitting a computation problem by data in this case the same task is performed on different chunks of data in the house example all windows could be installed in parallel in computing tasks you can for example compute a sum of each row of a matrix in parallel you have the same task which is the sum applied to different data which are the rows in summary if you partition a problem by task it means different tasks are applied to the same or different data if you partition by data the same task is performed on different data this way of partitioning is much more common and therefore it will be the focus of our course here is another example of partitioning by data if you have a sequence of operations say 1 + 2 + 3 and so on until 100 you can split it into multiple sums where each of the partial sums operates on a subset of the whole sequence and therefore is independent of the others thus the same task namely the sum is evaluated on different parts of the data if you have a large number of such independent tasks that have low or no communication needs such applications are often called embarrassingly parallel meaning it's embarrassing how easy it is to paralyze many statistical simulations belong to this category they usually have the following structure here in a pseudocode first a random number generator is initialized then a loop is constructed in which the same function here a myfunc is called usually on different data often in such applications the data is generated as draws from some probability distribution or it may be simply a subset of a bigger data set after collecting results from each iteration the results are processed for example written to the disk or visualization is created let's look at some real examples\n"