Blobs to Clips - Efficient End-to-End Video Data Loading - Andrew Ho & Ahmad Sharif, Meta

**Optimizing Data Loading in PyTorch: Techniques and Best Practices**

In this article, we will discuss various techniques and best practices for optimizing data loading in PyTorch. We will cover the use of TorchCoder, TorchCodec, and GPU preprocessing to improve performance and reduce latency.

**Switching from Baseline to TorchCoder**

When working with PyTorch, it is common to start with a baseline approach to data loading. However, switching to TorchCoder can result in significant improvements in throughput. By using TorchCoder, you can see a 7x increase in throughput compared to the baseline approach. This is because TorchCoder optimizes the data loading process by removing unnecessary overhead and leveraging multi-threading.

One of the key benefits of using TorchCoder is that it allows for offline decoding, which can be particularly useful when working with large datasets or limited bandwidth. By moving the compute-intensive tasks off-line, you can reduce the load on your GPU and improve overall performance.

**Using TorchCodec**

In addition to TorchCoder, TorchCodec is another powerful tool for optimizing data loading in PyTorch. TorchCodec allows developers to encode their data in a way that takes advantage of the GPU's capabilities. By using TorchCodec, you can see significant improvements in accuracy and performance compared to traditional encoding methods.

One of the key benefits of TorchCodec is that it allows developers to move compute-intensive tasks off-line, reducing the load on the GPU and improving overall performance. This can be particularly useful when working with large datasets or limited bandwidth.

**GPU Preprocessing**

GPU preprocessing is another technique for optimizing data loading in PyTorch. By preprocessing the data on the GPU, you can reduce the amount of data that needs to be transferred between the CPU and GPU, resulting in significant improvements in performance.

One of the key benefits of GPU preprocessing is that it allows developers to take advantage of the GPU's parallel processing capabilities. This means that multiple tasks can be performed simultaneously, resulting in significant improvements in throughput.

**Benchmarks and Experiments**

To further validate the effectiveness of these techniques, benchmarks and experiments are essential. By running end-to-end benchmarks and micro-benchmarks, you can gain a better understanding of how these techniques perform in different scenarios.

One key takeaway from our recent benchmarks is that even small improvements in data loading can result in significant gains in overall performance. By optimizing the data loading process, you can improve the accuracy and speed of your models, resulting in improved results.

**Best Practices**

There are several best practices to keep in mind when working with PyTorch and optimizing data loading. First, it is essential to identify bottlenecks in the data loading process before attempting to optimize it. By identifying areas where the bottleneck lies, you can focus your efforts on addressing those specific issues.

Second, make sure that you are actually bottlenecking before attempting to optimize. If the forward and backward pass is taking too long, it may be necessary to re-optimize other parts of the model or consider using a different architecture.

Finally, check out TorchCodec, which offers improved accuracy and performance for free. By leveraging the capabilities of TorchCodec, you can take advantage of the GPU's parallel processing capabilities and improve overall performance.

**Conclusion**

In conclusion, optimizing data loading in PyTorch requires careful consideration of various techniques and best practices. By switching to TorchCoder, using TorchCodec, and employing GPU preprocessing, developers can significantly improve the performance and accuracy of their models. Through benchmarks and experiments, it is essential to validate the effectiveness of these techniques and identify areas for further improvement.

"WEBVTTKind: captionsLanguage: enhi y'all uh we're just goingon to get started with uh our presentation today uh so Ahmad and I are here we're going to talk about um video decoding uh and video data loading so alad and I are part of Pi torch at meta uh we are part of these uh part of the torch data team um as well as the torch codec team so those are both uh those are both repos that you can go and check out and we'll be talking a little bit about today uh we're obviously part of a much larger team there's a lot of folks who went into the making of of what we're going to discuss today so the example we're going to go through uh we were able to improve the end-to-end training throughput for video data loading on a bit of a toy example but still pretty relevant um and uh we were able to improve it by about 20x uh so we work internally with teams at meta as well as with uh folks in open source and help to improve end to end training uh video data loading decoding and prepro um we're going to be discussing a bunch of different techniques today hopefully some of these are useful to you some you've probably already heard of or most likely tried yourself but uh hopefully you'll take away something new from here we're going to discuss things like how we improve the video decoder so there's a new decoder you're going to be able to try out uh in in open source and available through pip we're also going to be talking about some of the EXP experimental work we're doing in torch data to enable some things like um multi-threading uh and uh GPU prepr uh which the current data loader doesn't really support very well so video data loading is pretty diverse uh you could be doing something like a a generative AI where you're trying to generate you know very high resolution video uh you need very high spatial resolutions 4K um and very high temporal resolutions as well uh or you might be doing something like content understanding where you don't need maybe the resolution doesn't need to be very high the temporal resolution doesn't need to be very high uh but things like Randomness uh are extremely important for say an Integrity use case so what is it about video that makes it so difficult uh well the first is that video data sets are often really really large uh they could be multi- pedabytes um they're not going to fit on your host disk and these are going to be in distributed storage uh the next is that there can be a lot of variant in multiple Dimensions with video data sets so they can vary in length uh they can vary in resolution and in coding formats um you could do things like offline pre-pro in order to help with this but that can also often be very expensive for large production scale data sets uh and lastly decoding videos is extremely compute intensive so if you're using the host to decode you will often hit CPU bottlenecks uh in your training um or also very commonly CPU out of memories uh I'm going to be talking about um an example data set here uh which is the kinetics 400 and um this is uh an open source data set with about 240,000 training videos so it's relatively small um and uh the uh the video lengths are usually around 10 seconds so also pretty short uh but it's a good example for us today because um it's open source and it's something youall can try out and also pretty relevant uh we're going to be training a very simple model just a linear classifier uh we'll be doing distributed data parallel uh and we're training on uh 16 hosts each Having Eight Nvidia a100 gpus so 128 gpus total um and each host has 96 cores uh to share among those hpus uh and 1.5 1.5 terab B of memory this is and these are all configured in a training cluster with high-s speeed interconnects and the metrics we're going to be going through we're going to be tracking today uh so number one is throughput which we'll measure as frames per second we Define that as uh the number of totally uh total successfully decoded and transformed frames divided by the total training time uh and for today's example for each video we sample eight frames per video uh at uh 224x 224 resolution uh we'll also be tracking a metric that we call coverage so this is the percentage of successfully decoded videos uh and to get slightly more granular we're also going to be looking at uh what we'll call data time today um and if you can see my cursor on the screen yeah uh perfect um so imagine we have these three ranks training uh for one step and rank one takes the longest time to fetch that next batch so that's the maximum over all the different ranks that's what we call the data time for this step um so we'll be reporting the data time which is the sum of all of these maximums across the entire training run and then the last metric is uh what we're calling today the model time so this is uh basically the full step time minus that data time so whatever is left over and uh this is going to include things like uh transfers from host to Dev device uh the model forward and the model backward pass so for our Baseline um was we have this little cartoon over here uh our videos are stored in distributed blob storage um and we use the standard data loader to read a b a local batch size of 32 which gives us 4,000 videos per batch um and because we're sampling 8 frames from there that's uh 32,000 frames per batch we have 16 multiprocess workers so we're using the standard torch utils data loader in order to do multiprocessing to parallelize the data loading um and that's what these little boxes are each process is going to do its own work of reading blobs uh from the cloud storage we use the torch Vision decoder to trans uh to extract the frames um and then we do some light torch Vision transforms so things like cropping to 224 224 maybe some random flips uh finally we send that over the ipcq to the main trainer process and then the trainer does forward and backward uh and we run for about 10 epics but you know we kind of had to change this depending on the the speed uh the the total training time um right and we're using the torch Vision decoder for our Baseline so how did we do here uh we got about 6,000 frames per second uh and the coverage so we actually weren't able to successfully decode all of those it's probably because of uh some encoding format um that wasn't compiled into the the version of FFM pig that we had but uh if you look at the data time and the model time you can see that the training time was extremely dominated by loading data so our gpus were very very hungry uh in order to figure out the Headroom for this particular model so it's a very simple linear model um but uh we just throughout all of the uh online reading and in the trainer process we generate a random batch trans transer that to the device send that over to the trainer and you can see we get about a 40x Improvement in the throughput um and from the data metrics and the model metrics you can see that this is what we want we don't want to be spending any time waiting for data uh and now I'll let Oba talk about how we improve the decoder thank you Andrew um good afternoon everyone my name is emth and I work on the torch quc video decoding Library as Andrew mentioned video decoding is quite compute intensive let's dive in to understand what a video decoder actually does so a video is a sequence of images that is highly compressed and a decoder's job is to take a video and a Tim stamp from the user and extract out an RGB image that is displayed for that video at that Tim stamp and return it back as a pytorch tensor this process involves first seeking into the video at that Tim stamp decompressing it because a video is highly compressed then doing a color conversion process because V videos typically have yuv format and we want RGB format out and then converting that into a tensor if you do any of these processes wrong you can get correctness issues or you can get performance issues when we saw the torch Vision was slow we tried to do use a different video decoder which is decord which is presented in this in this slide decord is more efficient than torch Vision at the seeking part so it can see seek more efficiently in the middle of the videos so if you're sampling for frames in the middle of the videos decord is going to be much faster and that's what we were doing here so decord improves the performance over torch Vision by about 4X 4.8x as shown in the slide and the coverage also goes up because it's using a different ffmpeg version speaking of seeking in the middle of the videos there's an important gotcha here some of the video decoders that are out there they don't Implement seeking correct ly so they do in what is called inaccurate seeks if you give it a video and a timestamp it may return you a frame that is at a different Tim stamp which will be not valuable to to you as for your inference or training purposes to demonstrate this we generated a synthetic video which is shown above which is just a fractal video and each of the frame is different and we asked two different decoders to decode four different frames at four different time stamps in that video DEC a on the left side returned four different frames which is what you would expect but the pi decoder which is decoder B on the right side gave you four copies four copies of the same exact frame so this can this can this is not what users expect and it can lead to poor model performance when we saw this unintuitive seeking behavior and performance issues with the existing decoders we decided to create a new custom pytorch video decoding library that we call torch quc torch quc aims to be both accurate and fast so it's accurate both in terms of seeking and accurate in terms of the actual bits that you get out of it in terms of color conversion it is fast because it does seeking efficiently and it is fast at color conversion and tensor converting to tensors we all we try to allocate the minimal amount of memory that is needed to to do this process here's some more information about torch quc it's an open source Library available on GitHub and the release is available on pie so on a Linux machine you can just do pip install torch quc and it will you can start working on it it has apis for video metadata extraction video to tensor based on time or frame index and it has efficient batch decoding also torch quic is still a young library and we have planned work for doing GPU decoding audio decoding and better Samplers here's an example of how to use torch quc first you must install FFM Peg because it has a separate license and then you can install torch CC using pip and then in just a few lines of code you can create a decoder object from a video file that you have on disk the length of the decoder object will tell you how many frames it has and you can decod frames by using the square brackets annotation or you can decod a frame that is displayed at a particular timestamp by using the last function which is get frame displayed at here are some micro benchmarks that show torch Vision versus torch quc decode performance on a single machine the chart on the left shows decode performance for decoding sequential frames that is if you want to decode frames that are adjacent to each other the chart in the middle shows random axis decoding that is seeking Plus coding and the chart on the right shows decoding on a 4K video by the way the video on the the first two charts is from the hugging face Le robot repository which I highly recommend you check out there are three takeaways from this slide one decode performance Can Depend highly on the encoding format and the parameters used for example resolution or H6 264 versus 265 and so on second sequential de is much faster than random axis decoding as you can see the chart on the left has higher numbers than the chart in the middle and third torch quic is faster than torch Vision decoder in almost all scenarios lastly to test torch quic on a real world example we come back to our example that we come back to the workload that Andrew mentioned we implemented we added torch quic with the K400 Benchmark it has full bid accuracy with respect to FFM bag and it is seven times faster than torch vision now I'm going to hand it back to Andrew for the rest of the talk thanks Abad um so at this point we're going to Pivot slightly um so what you'll notice uh from the previous work is that we were pretty heavily CPU bottlenecked um and the you know one of the things you could think about doing is hey if we're bottlenecked by the CPU can we move that compute off of the host so that we're no longer bottlenecked so that's what we did here the the approach we chose uh there's a number of different ways you could do this but what we're going to describe today is uh doing the video decoding offline before you even start the training um so this turns out to help quite a bit uh so we've modified our little cartoon on the bottom uh we have a bunch of hosts that will decode those videos um and for this example we decided to store them as PNG uh there's this isn't always possible like depends on your particular use case uh for training like maybe you really want that Randomness maybe you're training for multiple epics and like or you really want that ability to experiment with the way you're changing the temporal sampling um but there are a lot of use cases out there for us we often see this in large scale offline inference where you know you really just want to Crunch through these frames as fast as possible um and you don't want that sampling to change um so there's a lot packed into this slide something else that we did when we did that decoding is that we moved the storage away from the distributed Cloud store and we put them onto discs that are collocated with the trainers so this eliminates any possibilities of doing cross region reads uh and lowers the variance in pulling the uh the data to the machines to the hosts the last thing we did is that in instead of writing the images all pointwise um we kind of write them in batches so that uh they're stored sequentially this lets us take advantage of sequential batch reading um on the training side as well so with the results from this we're able to see a 17x speed up over uh our Baseline now so uh and I think this is the first time we see that we're spending less time loading data than we are uh in the model which is promising but you know the gpus are still a little bit hungry something else interesting you might notice is that the model time has actually gone up which is really unintuitive and and a little bit perplexing so we haven't changed anything about the the simple linear model that we're doing we're still copying hosted device why is this model time slower so one of the things we suspected was that with multiprocessing um you have to send the data from the worker processes over to the trainer process uh over an ipcq and there's some CPU work that happens in the back end uh kind of invisibly um to help make that transfer more efficient so you're not doing a full uh so you're not having to do a full copy every time um but so uh yeah so in order to test this we uh basically uh hacked into the code a little bit and just before the batch hits the IPC Q we send an empty dictionary over the queue instead so we're basically not sending any data and then on the trainer side uh we use the same dummy batch that we did for the Headroom analysis um and you can see that this actually brings the model time back down uh which is um you know I think that's what we wanted to see but also still a little bit surprising this is obviously not a very useful approach to do though uh in production you can't just feed dummy batches that wouldn't really tell you anything interesting um so the question is how can we get rid of this IPC well with multi-threading you don't need to send things from one process to another uh the work can be done the batches can stay in the same process so you eliminate that transfer entirely uh but the current torch utils data loader does not yet support multi-threading so this is something we've been thinking heavily about um especially with the no Gill work coming uh coming up um in Python 313 uh uh so we've been developing a prototype uh we're planning to land some of these pieces in torch data we currently have an RFC out for this uh in the pytorch DAT repo so encourage you to check that out um and what this looks like is a more granular way to do parallelism so instead of having this one data set that you send to each process or one data set that gets parallelized through threads uh we actually are allowing you to do parallelism at the individual node level uh of your data loading dag so um in this case uh we're using multi-threaded parallelism in order to pull uh to pull data from the cloud or in this case the the collocated storage uh we pass that down to q and then we have another set of thread workers uh in order to parallel the PNG decoding so this is still happening on the CPU we're not doing video decoding anymore uh this is PNG and um we're doing the online prepr this is also on the CPU so the cropping and the flipping um and now we send that to the trainer process uh and the big question is we are running this experiment with the global interpreter lock so we still have the Gill this is not done with the no Gill yet so is this actually going to work uh in terms of improving our throughput uh the answer is no unfortunately so uh we moved away from that table and we're now looking at some plots uh because uh because uh we found that increasing the bash size uh we did start to see some winds so just to walk through this a little slowly um the yellow line here is the multi-threaded experimental code with the blue line that's the traditional multi-processing setup um and on the xais we're increasing the local batch size from 3264 uh up to 512 so so what you'll notice with this through throughput plot on the left is that by increasing the batch size you'd expect to see batching efficiencies from the model forward backward um from the reading um and you don't you're not able to capture that with the multiprocessing however with the multi-threading you can get up uh you can start to see that throughput increase uh as you increase the batch size so again this is something that's going to be very use case dependent you know your model you might not be able to run with batch sizes that large um however where we do see a lot of wins for these cases are for offline inference where you maybe don't have a backward pass and you're able to handle these uh uh you're able to handle these larger batch sizes the other thing which uh is probably apparent here is that with IP uh with multi-processing the memory requirements are a lot higher so you're not able to get to those large batch sizes you hit CPU ooms uh whereas with multi-threading it's a little more efficient so you're able to scale to those larger batch sizes uh with this plot on the right um we're just tracking the model time um so same thing here the yellow line you can see that the model time is dropping with those batching efficiencies as we'd expect so the last experiment that I'm going to describe for you today uh is where we thought hey why don't we try to use this batching um and use this underutilized GPU to do some of the GPU prepro so the cropping and the flips um why don't we do those on the GPU instead so what we did is that we made another cut in our graph here and in a single background thread so we're not using parallelism uh for this GPU prepr we're sending the device uh we're sending the batch to the GPU um and then we just run our torch Vision transforms as normal so now all of these pre-pro uh operations are going to happen uh on the GPU uh and yeah so the results that you see we're going to have a new plot a new line here on the plot with this uh marked with this little X um so we're able to get the highest throughput we've seen so far uh end to end throughput for this particular model um and you'll notice that the model time has dropped as well since we no longer need to copy that batch over to the uh device anymore it's already there and actually as part of the uh the pre-pro time right so we have I think a few minutes left uh I'll just summarize the discuss techniques uh so we started with this Baseline with the torch vision coder um by switching to torch codec which I think you can do today uh you should get a 7x uh throughput increase um if you're able to do offline decoding you can afford the storage um you can move over everything over uh to uh before you start the training do a lot of that CPU heavy work um and you can get even more increases here uh by moving to multi-threading we can see improved three put at larger batch sizes so this may or may not work for your particular Sy scenario and finally with GPU prepr we're able to see even more there's a bunch of stuff we haven't tried so um if you've seen nvidia's poster uh hopefully you got to talk to them last night but uh the uh the work with the NOG Gill so getting rid of that Global interpreter lock to really unlock multi-threading in Python is something we're extremely excited about and one of the reasons we're prioritizing this effort on Torch data so that's something we're planning to do some benchmarks and experiments and hopefully share to you later this year um some of the other things uh chatting with Nvidia uh there's a couple of things we could do uh to explore such as using the GPU to decode the images so in this uh presentation we used pgs which turned out to be not a great example if we had stored them as jpegs it turns out you can use uh the A1 100s to decode those jpegs as well so that's a lesson learned um similarly for video decoding so a lot of GPU gpus have dedicated video decoding chips so this is something where also exploring on the torch codec side so if you want to stay in an online world uh without paying for that storage or giving up that flexibility of sampling um but your gpus are just sitting there idle uh this is possibly an option for you as well so we're planning to do more experiments on this uh look out for some blog posts from us uh coming up uh so just a couple of takeaways before we wrap up um I think number one I was blown away when I saw ahmad's uh uh screenshots of like hey I think I'm sampling four frames and they're different but actually they're all identical which is um I don't know it's probably not what you want to be doing uh the next one is that you know run end to-end benchmarks I know you can't always do that uh it's not that affordable but uh I think from some of the micro benchmarks we've seen today uh you can tell that it's not it's not always uh it the straight line benchmarks the straight line decoding doesn't always translate into online distributed training so you can see some weird things happen there so this is important to do um the other thing to call out here is that make sure that you're actually bottleneck before you start trying to optimize stuff because a lot of times especially in some of the generative work we're seeing that forward and backward path can take so long that it eats up any of that data time so it's if you're not bottleneck then like you know don't try to pre uh to optimize too early um the third Point here is if you can move that compute off of the Box because that that's um uh because yeah you can see some particular you can see a ton of gains that way by uh to remove your CPU bottlenecks um and lastly uh check out torch codec uh you can get improved accuracy and performance um hopefully for free uh yeah so this is where you can find us uh we're at GitHub uh.com pytorch torch codec for the torch codec project um and we're at pytorch data for the torch data project uh so if you're interested in contributing to the future of data loading in torch data and by torch uh come check out this RFC on what we're building uh you can also find us on Twitter GitHub and we're available on the P torch slack as wellhi y'all uh we're just goingon to get started with uh our presentation today uh so Ahmad and I are here we're going to talk about um video decoding uh and video data loading so alad and I are part of Pi torch at meta uh we are part of these uh part of the torch data team um as well as the torch codec team so those are both uh those are both repos that you can go and check out and we'll be talking a little bit about today uh we're obviously part of a much larger team there's a lot of folks who went into the making of of what we're going to discuss today so the example we're going to go through uh we were able to improve the end-to-end training throughput for video data loading on a bit of a toy example but still pretty relevant um and uh we were able to improve it by about 20x uh so we work internally with teams at meta as well as with uh folks in open source and help to improve end to end training uh video data loading decoding and prepro um we're going to be discussing a bunch of different techniques today hopefully some of these are useful to you some you've probably already heard of or most likely tried yourself but uh hopefully you'll take away something new from here we're going to discuss things like how we improve the video decoder so there's a new decoder you're going to be able to try out uh in in open source and available through pip we're also going to be talking about some of the EXP experimental work we're doing in torch data to enable some things like um multi-threading uh and uh GPU prepr uh which the current data loader doesn't really support very well so video data loading is pretty diverse uh you could be doing something like a a generative AI where you're trying to generate you know very high resolution video uh you need very high spatial resolutions 4K um and very high temporal resolutions as well uh or you might be doing something like content understanding where you don't need maybe the resolution doesn't need to be very high the temporal resolution doesn't need to be very high uh but things like Randomness uh are extremely important for say an Integrity use case so what is it about video that makes it so difficult uh well the first is that video data sets are often really really large uh they could be multi- pedabytes um they're not going to fit on your host disk and these are going to be in distributed storage uh the next is that there can be a lot of variant in multiple Dimensions with video data sets so they can vary in length uh they can vary in resolution and in coding formats um you could do things like offline pre-pro in order to help with this but that can also often be very expensive for large production scale data sets uh and lastly decoding videos is extremely compute intensive so if you're using the host to decode you will often hit CPU bottlenecks uh in your training um or also very commonly CPU out of memories uh I'm going to be talking about um an example data set here uh which is the kinetics 400 and um this is uh an open source data set with about 240,000 training videos so it's relatively small um and uh the uh the video lengths are usually around 10 seconds so also pretty short uh but it's a good example for us today because um it's open source and it's something youall can try out and also pretty relevant uh we're going to be training a very simple model just a linear classifier uh we'll be doing distributed data parallel uh and we're training on uh 16 hosts each Having Eight Nvidia a100 gpus so 128 gpus total um and each host has 96 cores uh to share among those hpus uh and 1.5 1.5 terab B of memory this is and these are all configured in a training cluster with high-s speeed interconnects and the metrics we're going to be going through we're going to be tracking today uh so number one is throughput which we'll measure as frames per second we Define that as uh the number of totally uh total successfully decoded and transformed frames divided by the total training time uh and for today's example for each video we sample eight frames per video uh at uh 224x 224 resolution uh we'll also be tracking a metric that we call coverage so this is the percentage of successfully decoded videos uh and to get slightly more granular we're also going to be looking at uh what we'll call data time today um and if you can see my cursor on the screen yeah uh perfect um so imagine we have these three ranks training uh for one step and rank one takes the longest time to fetch that next batch so that's the maximum over all the different ranks that's what we call the data time for this step um so we'll be reporting the data time which is the sum of all of these maximums across the entire training run and then the last metric is uh what we're calling today the model time so this is uh basically the full step time minus that data time so whatever is left over and uh this is going to include things like uh transfers from host to Dev device uh the model forward and the model backward pass so for our Baseline um was we have this little cartoon over here uh our videos are stored in distributed blob storage um and we use the standard data loader to read a b a local batch size of 32 which gives us 4,000 videos per batch um and because we're sampling 8 frames from there that's uh 32,000 frames per batch we have 16 multiprocess workers so we're using the standard torch utils data loader in order to do multiprocessing to parallelize the data loading um and that's what these little boxes are each process is going to do its own work of reading blobs uh from the cloud storage we use the torch Vision decoder to trans uh to extract the frames um and then we do some light torch Vision transforms so things like cropping to 224 224 maybe some random flips uh finally we send that over the ipcq to the main trainer process and then the trainer does forward and backward uh and we run for about 10 epics but you know we kind of had to change this depending on the the speed uh the the total training time um right and we're using the torch Vision decoder for our Baseline so how did we do here uh we got about 6,000 frames per second uh and the coverage so we actually weren't able to successfully decode all of those it's probably because of uh some encoding format um that wasn't compiled into the the version of FFM pig that we had but uh if you look at the data time and the model time you can see that the training time was extremely dominated by loading data so our gpus were very very hungry uh in order to figure out the Headroom for this particular model so it's a very simple linear model um but uh we just throughout all of the uh online reading and in the trainer process we generate a random batch trans transer that to the device send that over to the trainer and you can see we get about a 40x Improvement in the throughput um and from the data metrics and the model metrics you can see that this is what we want we don't want to be spending any time waiting for data uh and now I'll let Oba talk about how we improve the decoder thank you Andrew um good afternoon everyone my name is emth and I work on the torch quc video decoding Library as Andrew mentioned video decoding is quite compute intensive let's dive in to understand what a video decoder actually does so a video is a sequence of images that is highly compressed and a decoder's job is to take a video and a Tim stamp from the user and extract out an RGB image that is displayed for that video at that Tim stamp and return it back as a pytorch tensor this process involves first seeking into the video at that Tim stamp decompressing it because a video is highly compressed then doing a color conversion process because V videos typically have yuv format and we want RGB format out and then converting that into a tensor if you do any of these processes wrong you can get correctness issues or you can get performance issues when we saw the torch Vision was slow we tried to do use a different video decoder which is decord which is presented in this in this slide decord is more efficient than torch Vision at the seeking part so it can see seek more efficiently in the middle of the videos so if you're sampling for frames in the middle of the videos decord is going to be much faster and that's what we were doing here so decord improves the performance over torch Vision by about 4X 4.8x as shown in the slide and the coverage also goes up because it's using a different ffmpeg version speaking of seeking in the middle of the videos there's an important gotcha here some of the video decoders that are out there they don't Implement seeking correct ly so they do in what is called inaccurate seeks if you give it a video and a timestamp it may return you a frame that is at a different Tim stamp which will be not valuable to to you as for your inference or training purposes to demonstrate this we generated a synthetic video which is shown above which is just a fractal video and each of the frame is different and we asked two different decoders to decode four different frames at four different time stamps in that video DEC a on the left side returned four different frames which is what you would expect but the pi decoder which is decoder B on the right side gave you four copies four copies of the same exact frame so this can this can this is not what users expect and it can lead to poor model performance when we saw this unintuitive seeking behavior and performance issues with the existing decoders we decided to create a new custom pytorch video decoding library that we call torch quc torch quc aims to be both accurate and fast so it's accurate both in terms of seeking and accurate in terms of the actual bits that you get out of it in terms of color conversion it is fast because it does seeking efficiently and it is fast at color conversion and tensor converting to tensors we all we try to allocate the minimal amount of memory that is needed to to do this process here's some more information about torch quc it's an open source Library available on GitHub and the release is available on pie so on a Linux machine you can just do pip install torch quc and it will you can start working on it it has apis for video metadata extraction video to tensor based on time or frame index and it has efficient batch decoding also torch quic is still a young library and we have planned work for doing GPU decoding audio decoding and better Samplers here's an example of how to use torch quc first you must install FFM Peg because it has a separate license and then you can install torch CC using pip and then in just a few lines of code you can create a decoder object from a video file that you have on disk the length of the decoder object will tell you how many frames it has and you can decod frames by using the square brackets annotation or you can decod a frame that is displayed at a particular timestamp by using the last function which is get frame displayed at here are some micro benchmarks that show torch Vision versus torch quc decode performance on a single machine the chart on the left shows decode performance for decoding sequential frames that is if you want to decode frames that are adjacent to each other the chart in the middle shows random axis decoding that is seeking Plus coding and the chart on the right shows decoding on a 4K video by the way the video on the the first two charts is from the hugging face Le robot repository which I highly recommend you check out there are three takeaways from this slide one decode performance Can Depend highly on the encoding format and the parameters used for example resolution or H6 264 versus 265 and so on second sequential de is much faster than random axis decoding as you can see the chart on the left has higher numbers than the chart in the middle and third torch quic is faster than torch Vision decoder in almost all scenarios lastly to test torch quic on a real world example we come back to our example that we come back to the workload that Andrew mentioned we implemented we added torch quic with the K400 Benchmark it has full bid accuracy with respect to FFM bag and it is seven times faster than torch vision now I'm going to hand it back to Andrew for the rest of the talk thanks Abad um so at this point we're going to Pivot slightly um so what you'll notice uh from the previous work is that we were pretty heavily CPU bottlenecked um and the you know one of the things you could think about doing is hey if we're bottlenecked by the CPU can we move that compute off of the host so that we're no longer bottlenecked so that's what we did here the the approach we chose uh there's a number of different ways you could do this but what we're going to describe today is uh doing the video decoding offline before you even start the training um so this turns out to help quite a bit uh so we've modified our little cartoon on the bottom uh we have a bunch of hosts that will decode those videos um and for this example we decided to store them as PNG uh there's this isn't always possible like depends on your particular use case uh for training like maybe you really want that Randomness maybe you're training for multiple epics and like or you really want that ability to experiment with the way you're changing the temporal sampling um but there are a lot of use cases out there for us we often see this in large scale offline inference where you know you really just want to Crunch through these frames as fast as possible um and you don't want that sampling to change um so there's a lot packed into this slide something else that we did when we did that decoding is that we moved the storage away from the distributed Cloud store and we put them onto discs that are collocated with the trainers so this eliminates any possibilities of doing cross region reads uh and lowers the variance in pulling the uh the data to the machines to the hosts the last thing we did is that in instead of writing the images all pointwise um we kind of write them in batches so that uh they're stored sequentially this lets us take advantage of sequential batch reading um on the training side as well so with the results from this we're able to see a 17x speed up over uh our Baseline now so uh and I think this is the first time we see that we're spending less time loading data than we are uh in the model which is promising but you know the gpus are still a little bit hungry something else interesting you might notice is that the model time has actually gone up which is really unintuitive and and a little bit perplexing so we haven't changed anything about the the simple linear model that we're doing we're still copying hosted device why is this model time slower so one of the things we suspected was that with multiprocessing um you have to send the data from the worker processes over to the trainer process uh over an ipcq and there's some CPU work that happens in the back end uh kind of invisibly um to help make that transfer more efficient so you're not doing a full uh so you're not having to do a full copy every time um but so uh yeah so in order to test this we uh basically uh hacked into the code a little bit and just before the batch hits the IPC Q we send an empty dictionary over the queue instead so we're basically not sending any data and then on the trainer side uh we use the same dummy batch that we did for the Headroom analysis um and you can see that this actually brings the model time back down uh which is um you know I think that's what we wanted to see but also still a little bit surprising this is obviously not a very useful approach to do though uh in production you can't just feed dummy batches that wouldn't really tell you anything interesting um so the question is how can we get rid of this IPC well with multi-threading you don't need to send things from one process to another uh the work can be done the batches can stay in the same process so you eliminate that transfer entirely uh but the current torch utils data loader does not yet support multi-threading so this is something we've been thinking heavily about um especially with the no Gill work coming uh coming up um in Python 313 uh uh so we've been developing a prototype uh we're planning to land some of these pieces in torch data we currently have an RFC out for this uh in the pytorch DAT repo so encourage you to check that out um and what this looks like is a more granular way to do parallelism so instead of having this one data set that you send to each process or one data set that gets parallelized through threads uh we actually are allowing you to do parallelism at the individual node level uh of your data loading dag so um in this case uh we're using multi-threaded parallelism in order to pull uh to pull data from the cloud or in this case the the collocated storage uh we pass that down to q and then we have another set of thread workers uh in order to parallel the PNG decoding so this is still happening on the CPU we're not doing video decoding anymore uh this is PNG and um we're doing the online prepr this is also on the CPU so the cropping and the flipping um and now we send that to the trainer process uh and the big question is we are running this experiment with the global interpreter lock so we still have the Gill this is not done with the no Gill yet so is this actually going to work uh in terms of improving our throughput uh the answer is no unfortunately so uh we moved away from that table and we're now looking at some plots uh because uh because uh we found that increasing the bash size uh we did start to see some winds so just to walk through this a little slowly um the yellow line here is the multi-threaded experimental code with the blue line that's the traditional multi-processing setup um and on the xais we're increasing the local batch size from 3264 uh up to 512 so so what you'll notice with this through throughput plot on the left is that by increasing the batch size you'd expect to see batching efficiencies from the model forward backward um from the reading um and you don't you're not able to capture that with the multiprocessing however with the multi-threading you can get up uh you can start to see that throughput increase uh as you increase the batch size so again this is something that's going to be very use case dependent you know your model you might not be able to run with batch sizes that large um however where we do see a lot of wins for these cases are for offline inference where you maybe don't have a backward pass and you're able to handle these uh uh you're able to handle these larger batch sizes the other thing which uh is probably apparent here is that with IP uh with multi-processing the memory requirements are a lot higher so you're not able to get to those large batch sizes you hit CPU ooms uh whereas with multi-threading it's a little more efficient so you're able to scale to those larger batch sizes uh with this plot on the right um we're just tracking the model time um so same thing here the yellow line you can see that the model time is dropping with those batching efficiencies as we'd expect so the last experiment that I'm going to describe for you today uh is where we thought hey why don't we try to use this batching um and use this underutilized GPU to do some of the GPU prepro so the cropping and the flips um why don't we do those on the GPU instead so what we did is that we made another cut in our graph here and in a single background thread so we're not using parallelism uh for this GPU prepr we're sending the device uh we're sending the batch to the GPU um and then we just run our torch Vision transforms as normal so now all of these pre-pro uh operations are going to happen uh on the GPU uh and yeah so the results that you see we're going to have a new plot a new line here on the plot with this uh marked with this little X um so we're able to get the highest throughput we've seen so far uh end to end throughput for this particular model um and you'll notice that the model time has dropped as well since we no longer need to copy that batch over to the uh device anymore it's already there and actually as part of the uh the pre-pro time right so we have I think a few minutes left uh I'll just summarize the discuss techniques uh so we started with this Baseline with the torch vision coder um by switching to torch codec which I think you can do today uh you should get a 7x uh throughput increase um if you're able to do offline decoding you can afford the storage um you can move over everything over uh to uh before you start the training do a lot of that CPU heavy work um and you can get even more increases here uh by moving to multi-threading we can see improved three put at larger batch sizes so this may or may not work for your particular Sy scenario and finally with GPU prepr we're able to see even more there's a bunch of stuff we haven't tried so um if you've seen nvidia's poster uh hopefully you got to talk to them last night but uh the uh the work with the NOG Gill so getting rid of that Global interpreter lock to really unlock multi-threading in Python is something we're extremely excited about and one of the reasons we're prioritizing this effort on Torch data so that's something we're planning to do some benchmarks and experiments and hopefully share to you later this year um some of the other things uh chatting with Nvidia uh there's a couple of things we could do uh to explore such as using the GPU to decode the images so in this uh presentation we used pgs which turned out to be not a great example if we had stored them as jpegs it turns out you can use uh the A1 100s to decode those jpegs as well so that's a lesson learned um similarly for video decoding so a lot of GPU gpus have dedicated video decoding chips so this is something where also exploring on the torch codec side so if you want to stay in an online world uh without paying for that storage or giving up that flexibility of sampling um but your gpus are just sitting there idle uh this is possibly an option for you as well so we're planning to do more experiments on this uh look out for some blog posts from us uh coming up uh so just a couple of takeaways before we wrap up um I think number one I was blown away when I saw ahmad's uh uh screenshots of like hey I think I'm sampling four frames and they're different but actually they're all identical which is um I don't know it's probably not what you want to be doing uh the next one is that you know run end to-end benchmarks I know you can't always do that uh it's not that affordable but uh I think from some of the micro benchmarks we've seen today uh you can tell that it's not it's not always uh it the straight line benchmarks the straight line decoding doesn't always translate into online distributed training so you can see some weird things happen there so this is important to do um the other thing to call out here is that make sure that you're actually bottleneck before you start trying to optimize stuff because a lot of times especially in some of the generative work we're seeing that forward and backward path can take so long that it eats up any of that data time so it's if you're not bottleneck then like you know don't try to pre uh to optimize too early um the third Point here is if you can move that compute off of the Box because that that's um uh because yeah you can see some particular you can see a ton of gains that way by uh to remove your CPU bottlenecks um and lastly uh check out torch codec uh you can get improved accuracy and performance um hopefully for free uh yeah so this is where you can find us uh we're at GitHub uh.com pytorch torch codec for the torch codec project um and we're at pytorch data for the torch data project uh so if you're interested in contributing to the future of data loading in torch data and by torch uh come check out this RFC on what we're building uh you can also find us on Twitter GitHub and we're available on the P torch slack as well\n"