Lightning Talk - Enhancements Made to MPS Backend in PyTorch for Applications Running... - Kulin Seth

**U Sample Network with PyTorch**

A simple U sample network was implemented using PyTorch, which is also available for developers to take a look at now. This is a startman example that can be implemented directly as an op in PyTorch but I'll use this as a way how we can profile it if it was not available as a backend op.

**Profiling with MPS Profiler**

To profile the network, we can use the start and stop APIs which were enabled in the MPS profiler. There are different modes to capture the signpost information, such as system trace. This allows us to visualize all the OS signpost information along with the PyTorch signpost information.

By enabling these API's, we get a system trace that uses all the OS signpost information and can be visualized using a tool called metal system trace. The top of the timeline shows the blit calls, fallback CPU kernels, and operations that were actually executed on MPS. This allows us to start introspecting our network and see which operations the application is spending most time in.

**Custom Kernel Support**

To improve performance, we can add custom kernel support using Pyon 11. This involves three steps: implementing the operation in C++, creating a Python binding for the object, building the extension, and finally importing and using it in our application.

First, we import the torch extension header which includes all the Pyo's bits to write the C++ extension. Then, we use APIs such as get command buffer MPS backend API to get a reference to the MPS stream command buffer. This work is similar to what we do on the CPU side and allows us to use optimizations like commit and continue to reduce the CPU side overhead.

Next, we create an encoder using the dispatch queue and define our custom GPU kernel on that. We then encode the kernel using the dispatch queue to ensure submissions from multiple threads are serialized after all work is encoded. Finally, we use the synchronize API to wait until the command buffer is done or use another API if needed internally.

**Customizing with Pyon 11**

To customize this code further, we can use Pyon 11 to bind the object C functions into Python in a similar manner. This involves using CP CPP extension to build the custom soft sharing library which can be included in our application.

After completing these steps, our custom kernel support is ready to be used in our application and we have replaced the slow and falling back to CPU soft shrink with our customized MPS shrink library.

**Performance Results**

With the newly added custom kernel support, the model runs much more efficiently. All copies and intermediate tensors created by the fallback to the CPU are gone, and the sequential model runs much faster.

The performance results show that the MPS backend is up to 5x faster compared to our initial release across various benchmarks from TouchBench.

**Additional APIs**

There are also additional APIs such as record weight and elapsed time for event management, and APIs on the MPS allocators such as set per process memory fraction which exposes developers much more fine grain control over the backend for memory operations.

Finally, to wrap up this talk, let's summarize some of the performance results. The MPS backend has been significantly optimized with Pyo's 2.0 and Mac OS 14.

"WEBVTTKind: captionsLanguage: enhi um my name is kinet uh I work at Apple in MPS team and today I'll discuss enhancements made to the MPS uh backend in pych so let's get started um so I'll go over the qualification of MPS backin uh to the beta stage um new features we have added such as the profiler support custom kernel and developer apis to the MPS backend and we will close out with some of the performance improvements made to the MPS backend so since are released last year now starting with the beta stage just to recap um MPS back and started its Journey with pyo 1.12 last year when we introduced the GPU accelerated pyo on Mac platforms since then multiple improvements have been made for optimizing the memory usage and VI tensors this year with py 2.0 the MPS back in has been qualified for the beta stage now what that means is we have support for top 60 most used operators and more um the testing coverage has greatly improved and the network coverage has expanded as multiple popular models adopted the backend as a default backend on Mac OS but these were not all the improvements we made there is support for new features available in the latest pyd builds and we are constantly making progress on it such just profiling support custom kernel and some of the developer apis which I'll go in more detail later and furthermore the developers not only adopted the pyo MPS backend in their um external networks but also have contributed code for many new operators um into our code base such as histogram group Norm sign bit pixel Shuffle and many more now let's discuss some of the new features added to the MPS backend starting with the profile support which was enabled using OS signpost which is a feature in R it highlights operations which are executed on MPS backend Blitz between CPU and GPU and some of the operations which fall back to CPU to use profiler your application um we have a simple example which I'll go over and some of the API needed to enable that support um it has been integrated to the metal system trace and there's an uh command line tool which is also available for developers to take a look at now let's look at a simple U sample Network which uses a sequential model composed of linear layers and softwaring activation functions now this is a stman example um you can implement it directly as an op in py but I'm going to use this as a way how we can profile it if it was is not available as a backend op what we can do is that we can use the start and stop apis which we enabled in MPS profiler and have different modes to capture the signpost information as shown below so what happens as a result of this you you get a system Trace which uses all the OS signpost information after enabling those API and it can be visualized using a tool called metal system trace it has all a lot of other information along with the fact that whatever the sign post which we had enabled as part of the pytorch is also represented along with the other things in the timeline so here it highlights like the blit calls shown in the top you have fallback CPU kernels so the things which didn't actually got accelerated on MPS and operations which were actually executed on MPS and it allows you to actually start introspecting our Network now as you can see here the soft shrink at this point when we captured it was falling back to CPU which which is very slow just aside for developers who would like to quickly look at which operations the application is spending most time in we also have a facility of command line tool as shown using an environment variable you can dump out information about each layer such as the data types shapes GPU time and it allows you to introspect their application quickly now continuing with our example from before we saw that the soft shrink was falling back to CPU which leaves a big gap in the GPU timeline now to improve the performance we'll go over some of the custom kernel support we added which is one of the ways to do this now there are three steps to write a custom operation you first implement the operation in object to see in metal you create the python bindings and build your extension and finally after your extension is built you can import import that operation into your application and begin using it so let's start with the operation implementation there's a lot of code but I'll go over from start from the top so you start with the importing the torch extension header which includes all the pyos bits to write the C++ extension now after that there are apis which we have exposed to enable the custom functionality so you have this get command buffer MPS backend API to get a reference to the MPS stream command buffer now this is the same command buffer we use in the backend to encode work onto so the work you are doing is as same as and it's a first class citizen as what we are doing work on and this enables to use optimizations like commit and continue to reduce the CPU side overhead which was discussed in last year's talk also uh we have this get dispatch Q API to get a reference to the C Q next create an encoder using that command buffer which we got a reference to and it allows you to Define your custom GPU kernel on that you encode the kernel using the dispatch queue to ensure the submissions from multiple threads are serialized after all the work is encoded you use the synchronize API to wait until the command buffer is done or if you don't need calization you can use the API and this allows you to internally use commission continue now moving forward in the second step of the custom colel support you can use the Pyon 11 to bind the objectiv C functions into python in a similar manner so in a simple manner next using the CP CPP extension you can build the custom soft sharing Library which can be included in your application so as a final step the custom build library is ready to be used in your application and we have replaced the soft shrink which was earlier slow and falling back to CPU with the your customized MPS shrink library now with the newly added custom kernel support the model runs much more efficiently all the copies and the intermediate tensors created by the fallback to the CPU are gone and the sequential model runs much faster here are some of the additional apis on events such as record weight and elapse time to do event management and create custom timing operations moreover apis on the MPS allocators such as set per process memory fraction exposes developers much more fine grain control over the backend for memory operations finally to wrap up this talk let's go some of the performance results as you can see the MPS backend has been significantly optimized with pyos 2.0 and Mac os14 the MPS backend is up to 5x faster compared to our initial release across various benchmarks from touch bench that's all for uh that's it for me uh for me today um thanks for listening and have a great conference ahead and if there are more questions I'm available and please uh reach out um that's it thank youhi um my name is kinet uh I work at Apple in MPS team and today I'll discuss enhancements made to the MPS uh backend in pych so let's get started um so I'll go over the qualification of MPS backin uh to the beta stage um new features we have added such as the profiler support custom kernel and developer apis to the MPS backend and we will close out with some of the performance improvements made to the MPS backend so since are released last year now starting with the beta stage just to recap um MPS back and started its Journey with pyo 1.12 last year when we introduced the GPU accelerated pyo on Mac platforms since then multiple improvements have been made for optimizing the memory usage and VI tensors this year with py 2.0 the MPS back in has been qualified for the beta stage now what that means is we have support for top 60 most used operators and more um the testing coverage has greatly improved and the network coverage has expanded as multiple popular models adopted the backend as a default backend on Mac OS but these were not all the improvements we made there is support for new features available in the latest pyd builds and we are constantly making progress on it such just profiling support custom kernel and some of the developer apis which I'll go in more detail later and furthermore the developers not only adopted the pyo MPS backend in their um external networks but also have contributed code for many new operators um into our code base such as histogram group Norm sign bit pixel Shuffle and many more now let's discuss some of the new features added to the MPS backend starting with the profile support which was enabled using OS signpost which is a feature in R it highlights operations which are executed on MPS backend Blitz between CPU and GPU and some of the operations which fall back to CPU to use profiler your application um we have a simple example which I'll go over and some of the API needed to enable that support um it has been integrated to the metal system trace and there's an uh command line tool which is also available for developers to take a look at now let's look at a simple U sample Network which uses a sequential model composed of linear layers and softwaring activation functions now this is a stman example um you can implement it directly as an op in py but I'm going to use this as a way how we can profile it if it was is not available as a backend op what we can do is that we can use the start and stop apis which we enabled in MPS profiler and have different modes to capture the signpost information as shown below so what happens as a result of this you you get a system Trace which uses all the OS signpost information after enabling those API and it can be visualized using a tool called metal system trace it has all a lot of other information along with the fact that whatever the sign post which we had enabled as part of the pytorch is also represented along with the other things in the timeline so here it highlights like the blit calls shown in the top you have fallback CPU kernels so the things which didn't actually got accelerated on MPS and operations which were actually executed on MPS and it allows you to actually start introspecting our Network now as you can see here the soft shrink at this point when we captured it was falling back to CPU which which is very slow just aside for developers who would like to quickly look at which operations the application is spending most time in we also have a facility of command line tool as shown using an environment variable you can dump out information about each layer such as the data types shapes GPU time and it allows you to introspect their application quickly now continuing with our example from before we saw that the soft shrink was falling back to CPU which leaves a big gap in the GPU timeline now to improve the performance we'll go over some of the custom kernel support we added which is one of the ways to do this now there are three steps to write a custom operation you first implement the operation in object to see in metal you create the python bindings and build your extension and finally after your extension is built you can import import that operation into your application and begin using it so let's start with the operation implementation there's a lot of code but I'll go over from start from the top so you start with the importing the torch extension header which includes all the pyos bits to write the C++ extension now after that there are apis which we have exposed to enable the custom functionality so you have this get command buffer MPS backend API to get a reference to the MPS stream command buffer now this is the same command buffer we use in the backend to encode work onto so the work you are doing is as same as and it's a first class citizen as what we are doing work on and this enables to use optimizations like commit and continue to reduce the CPU side overhead which was discussed in last year's talk also uh we have this get dispatch Q API to get a reference to the C Q next create an encoder using that command buffer which we got a reference to and it allows you to Define your custom GPU kernel on that you encode the kernel using the dispatch queue to ensure the submissions from multiple threads are serialized after all the work is encoded you use the synchronize API to wait until the command buffer is done or if you don't need calization you can use the API and this allows you to internally use commission continue now moving forward in the second step of the custom colel support you can use the Pyon 11 to bind the objectiv C functions into python in a similar manner so in a simple manner next using the CP CPP extension you can build the custom soft sharing Library which can be included in your application so as a final step the custom build library is ready to be used in your application and we have replaced the soft shrink which was earlier slow and falling back to CPU with the your customized MPS shrink library now with the newly added custom kernel support the model runs much more efficiently all the copies and the intermediate tensors created by the fallback to the CPU are gone and the sequential model runs much faster here are some of the additional apis on events such as record weight and elapse time to do event management and create custom timing operations moreover apis on the MPS allocators such as set per process memory fraction exposes developers much more fine grain control over the backend for memory operations finally to wrap up this talk let's go some of the performance results as you can see the MPS backend has been significantly optimized with pyos 2.0 and Mac os14 the MPS backend is up to 5x faster compared to our initial release across various benchmarks from touch bench that's all for uh that's it for me uh for me today um thanks for listening and have a great conference ahead and if there are more questions I'm available and please uh reach out um that's it thank you\n"