[MLIR] Enabling Composition of Kernels and Compilers - Jacques Pienaar, Google

Compilers Versus Kernels: A False Dichotomy

In recent years, there has been an ongoing debate between compilers and kernels in the field of high-performance computing. One side argues that compilers are the key to achieving optimal performance, while the other side claims that kernels are the way forward. However, this dichotomy is not as clear-cut as it seems. In reality, both compilers and kernels are essential components of a high-performance system, and combining their strengths can lead to significant improvements in performance.

One area where compilers excel is in producing optimized kernels or operations right from the start. These intrinsics are highly specific to the problem at hand and can be tailored to take advantage of specific hardware features. By incorporating these intrinsics into the compiler pipeline, developers can create optimized kernels that are specifically designed for a particular workload or deployment. This approach allows compilers to produce high-quality, optimized code without requiring manual intervention.

However, there is a catch. Intrinsic optimization is not always possible, and in some cases, it may be difficult or impossible to find the optimal solution. Additionally, optimizing intrinsics can be time-consuming and requires significant expertise in low-level programming and compiler internals. Furthermore, as new instruction sets and architectures emerge, compilers must adapt quickly to take advantage of these changes. This is where kernels come into play.

Kernels are a lower-level way of expressing code that allows developers to write custom instructions for specific hardware features. Unlike intrinsics, which are typically high-level abstractions, kernels require direct access to the underlying hardware and are often written in low-level assembly language. By using kernels, developers can achieve fine-grained control over the optimization process and take advantage of specific hardware features that may not be supported by intrinsics.

One approach to combining compilers and kernels is to use a hybrid system where the compiler generates high-quality code that can be optimized further using custom kernels. This approach allows developers to leverage the strengths of both compilers and kernels, without being limited by either one. By utilizing micro-kernels in the design, it is possible to create a system where the compiler pipeline targets specific hardware units, which can then be optimized using custom kernels.

This approach has several advantages over traditional compiler-based optimization techniques. Firstly, it allows developers to take advantage of specific hardware features that may not be supported by compilers alone. Secondly, it provides fine-grained control over the optimization process, allowing developers to tailor the optimization strategy to the specific requirements of their workload or deployment. Finally, it reduces the complexity and cost associated with traditional compiler-based optimization techniques.

In addition to reducing complexity and improving performance, hybrid systems also provide a more flexible and maintainable solution. By using kernels, developers can easily adapt to changing instruction sets and architectures, without having to rewrite large portions of the code. Moreover, by utilizing micro-kernels, it is possible to create a system where the compiler pipeline can be switched on or off, depending on the requirements of the workload or deployment.

The benefits of this approach are not limited to performance improvements alone. By providing fine-grained control over the optimization process and reducing complexity, hybrid systems also improve maintainability and support for new workloads and deployments. Moreover, by leveraging the strengths of both compilers and kernels, developers can create optimized systems that are better suited to specific use cases.

At Eerie, this approach has been successfully applied in several projects. By utilizing high-level intrinsics and custom kernels, developers have achieved significant performance improvements across a range of workloads and deployments. Moreover, by combining the strengths of both compilers and kernels, it is possible to create optimized systems that are more maintainable and easier to adapt to changing requirements.

One notable example of this approach is in the development of custom TPU kernels. Historically, TPUs were designed with a limited set of operations in mind, and developers had to rely on software-based workarounds to achieve better performance. However, as machine learning models have evolved, so too have the requirements for TPU-based optimization techniques.

To address this challenge, researchers at Eerie introduced custom TPU kernels as a way to enable portability and decrease developer space and support. While these kernels were initially designed as a fallback solution, they have proven to be highly effective in achieving better performance than software-based workarounds.

In addition to their performance benefits, custom TPU kernels also provide fine-grained control over the optimization process. By allowing developers to write low-level assembly code, these kernels enable developers to tailor the optimization strategy to specific requirements and take advantage of unique hardware features that may not be supported by software-based solutions.

The success of this approach has been recognized within the industry, with many researchers and developers adopting similar techniques for optimizing TPU-based systems. Moreover, as machine learning models continue to evolve and require more computational resources, the need for optimized optimization techniques will only grow.

In conclusion, compilers and kernels are not mutually exclusive concepts, but rather two essential components of a high-performance system. By combining their strengths and leveraging the benefits of hybrid approaches, developers can achieve significant improvements in performance, maintainability, and support for new workloads and deployments. As machine learning models continue to evolve and require more computational resources, the importance of optimized optimization techniques will only grow.

"WEBVTTKind: captionsLanguage: enhi everyone uh I I'm jacqu uh I work at Google I work on uh ML compilers and different deployment tooling um and today you know I'll be talking a little bit about like enabling composition of kernels and compilers you know some of the work we've been doing with with like different compiler projects at Google so a lot of the talk is actually going to be you know talking a little bit about ml ml is a collection of modular modular and reusable software components that enable the progressive lowering of high level operations to efficiently Target Hardware uh in a common way so this is uh a toolkit that we use in many of our different projects and deployment scenarios uh it is a compiler to kit in general not ml specific you know used in many different domains but today I I'll focus on like the the the ml section of the world um because well that that's a a big section to begin with um so I sort of when preparing for this talk I I look back at actually like the very first slide we made about ml so this is like back in 2018 you know we had this goal slide where we said like what do we want to do so at this point we were building some optimization infra for for tensorflow we were redoing the graph infrastructure part and and the one part that sort of like stood out to me is like we had well Wei had a section about like customization you know the ability to be able to intermix different components the ability to be able to to customize the compilation process as well as enable more efficient targeting of these um of these system systems now one of the funny things that has happened since then ml was been released it's been in the industry it's adopted at many different companies and you know I've actually had the same conversation with multiple different Folks at different companies where you know they would come to me and say like you know uh you know like M has been useful but it it's it's been a little bit problematic because we used to have a compiler team and we used to have a library team now both folks are working on on ML and and it's getting a little bit fuzzy who supposed to be doing what um you know because we have Library team writing generators for their kernels so they using ml the python bindings and whatnot the high level optimizations as as a toolbox to write some of these kernels but we have we have also have the compiler team you know using uh you know incorporating kernels as well as micr kernels you know so like tongue and cheek you know is it the success if the project causes reorgs in other organizations maybe um but you know it could also just be like I think in in a composable uh compiled Pipeline with Mod components this line just naturally gets blurry you know like it there is the uh like a common utility as well as like benefit of using these Technologies now there there's no Silver Bullet to solving this tension between like should we used Library should be you know what should we use where to what extent how much human how much you know how much automatic how much do we search and whatnot so I mean ml provides all the infrastructure to build ir and Transformations uh but and it makes it easy to iterate on IR design and try different things out but composability is still the key right so there's a big human design component on it uh composability is never perfect you know assembling entire tool chains is still work you know and there's a lot of design that actually goes into these deep learning compilers to get good performance and to ensure you know you have a predictable uh lowering path that gives you good performance so I'm going to give a couple of examples of you know where we we've sort of like tried to well we and others have Incorporated like some of these ideas and and sort of like mixed you know like kernels as well as as like compilers to try and get like better performance and and some of the things that you know worked and didn't work you know and one example we we were we had a focus looking at an ear compiled module so you have a model they were compiling it down and they were looking and they were trying to improve performance right so they did the performance analysis they they did their histograms they identified the most computationally intensive Ops they they found a high performance library that has these exact same Ops you know and folks got quite excited because they found the the kernels were actually 4X faster than the generated one so it's like this is awesome right we have 4X higher kernels we can just slap it in it's the most expensive op so this this should go awesome uh except of course reality is after a couple of weeks of integration and tuning the the end to- endend results was actually 29% worse right so like incorporating this high performance colel for the most computationally intensive part ended up in degrading the performance of the model quite significantly now why is this well for one it blocked compiler's ability to optimize you know it by by using this kernel for for these operations it it disabl the ability to do fusions it broke up tiling so now suddenly you have additional synchronization you have additional transfers uh additional serialization in the process a lot of these colel liers are actually written assuming they are the entire system you know so for example in this case this one is made internal copies of all the data before it operated on it uh and this meant like additional copies that was not required also they're written assuming they control everything so the the way they've been tuned and finally optimized you know uh it it's it's it's I think a very delicate balance to to get like the exact utilization by considering the the fretting the read write ports of the instruction dispatch etc etc Now by incorporating this and being part of a system you actually destroyed some the assumptions that has been made about locality and then also very importantly the execution model there was a mismatch between what the system was doing and the compiler was generating and what these libraries were doing right so the these libraries were being inserted where single threed functions were being generated normally because the execution model is sort of was sort of similar to how you would do with GPU multiple frights everyone SD style execution wise um and anybody who has used like Ned parallelism sort of kind of a test that it gets very tricky very quickly now what's one of the solutions I mean there's of course many you know and one of them is like actually microc kernels something that incorporates better with the compiler you know and as a disclaimer everybody claims to have microc kernels and nobody can agree what they are or the granularity that is you know we had one discussion where somebody said oh well no we have a micral you guys actually have nanoc kernels you know so like it's it's what it is but uh and I mean on the right you know the go-to pl like classical example represented like in the Bliss formalism where I go down to like the little micr as well uh in in in short you know the way I think about it is like highly optimized building blocks use in multiple different kernels Cen to produce optimized kernels or operations right so these are things that could actually have been intrinsics high level intrinsics or even Hardware units and importantly they can also stand in for Hardware units so this is a thing where like a system where you have microc kernels in your design like when you're trying to Target new hardware you know you can use the this micr kernel initially make your compiler pipeline Target it and then if it's able to Target it well then hopefully by the time it you have your Hardware unit back it's a simple switch because it's just like a oneto one flip there um like I said there's a couple of reasons why it avoids solving the hard problem generally you know perfectly scheduling all these instructions at a low level is quite difficult having a user right a couple by hand and tuning you know very aggressively is quite easy uh new isas come out compilers can necessarily evolve as quickly as you know as some of the models or adaptability you know it's not really a thing to go and hack the Isel instruction selection of a compiler right there's huge implications but changing a microc kernel or specializing for one is quite easy uh also there's many folks that can write C fewer folks that can write compilers so like being able to ex enable folks to do both we have the C and intrinsic skills to do this I think is quite useful and then it's a stepping stone to Cen to figuring out what you need in general now at Eerie in the level it's sort of like at the register level you know this actually enables the like the fusion together both fusion and granularity sorry an execution granularity sort of match it makes it easy to co-optimize and schedule and run and you know you can think of this as just high level intrinsics and compiler has to support it anyway so it's sort of like a natural fit in the system and the performance results or improvements from utilizing them is actually quite good now I mentioned that you know for a lot of the the work you know you actually need to be able to write intrinsics write the assembly you know and and one of the things that has been happening like in in in uh in tpus is the ability to write some custom TPU kernels now historically xay was very clear like hos are the set of Ops goal is to enable portability as well as a decrease compile developer space and support right make it more maintainable make it easier by constraining the upset uh you know fullback custom cels were initially introduced as a fallback you know Hal to World kind of thing right if you use them your performance degrades it's your you accept it but I think like the requirements have changed as the ml models have changed and you know this means support for custom kernels uh you know has been felt more and more and you know I I think it's it's not just in in open xay but also in P A lot of these same discussions have been happening as seen from some of the posts uh now in in Jacks itself you know the way these have been introduced is sort of like you have Palace as a high level language that lowers it it down and it's able to Target these things mosa TPU is this lower level format it's not Jackson specific it's a python interface uh effectively it's it's like syntactic sugar on the generate mlr python bindings you know folks that know think of like sort of like mem vector and aift level it's lowlevel assembly uh which is good it means you have full customer well it's good or and or bad depending on how you use it right but this gives you the o opportunity to actually do some very low-level detail things um and you know I think it's one where you know it's been uh adopted quite widely internally and you know it's actually succeeded where multiple previous approaches have actually failed and one of the reasons why is it it just it does not fight against the compiler it was designed to fit the constraints of the existing pipeline it it it had considerations made as well as ongoing improvements to increase the blurring between what is custom what is not um and also it doesn't actually reimplant parts of compiler it utilizes the the compiler passes and infrastructure so it it's it's actually beautiful feeding back into one another uh in terms of being able to increase compiler while working into it so sort of as summary and as the lightning talk comes to a lightning end you know uh you know compilers versus kernels is sort of a false dichotomy you know it it's it's uh folks often will say it is but you know I I think once one the a little bit deeper it these are two essential components to being able to produce optimized performance for the m workloads and deployments you know and we actually get better results by utilizing the strengths of both and combining them together thank youhi everyone uh I I'm jacqu uh I work at Google I work on uh ML compilers and different deployment tooling um and today you know I'll be talking a little bit about like enabling composition of kernels and compilers you know some of the work we've been doing with with like different compiler projects at Google so a lot of the talk is actually going to be you know talking a little bit about ml ml is a collection of modular modular and reusable software components that enable the progressive lowering of high level operations to efficiently Target Hardware uh in a common way so this is uh a toolkit that we use in many of our different projects and deployment scenarios uh it is a compiler to kit in general not ml specific you know used in many different domains but today I I'll focus on like the the the ml section of the world um because well that that's a a big section to begin with um so I sort of when preparing for this talk I I look back at actually like the very first slide we made about ml so this is like back in 2018 you know we had this goal slide where we said like what do we want to do so at this point we were building some optimization infra for for tensorflow we were redoing the graph infrastructure part and and the one part that sort of like stood out to me is like we had well Wei had a section about like customization you know the ability to be able to intermix different components the ability to be able to to customize the compilation process as well as enable more efficient targeting of these um of these system systems now one of the funny things that has happened since then ml was been released it's been in the industry it's adopted at many different companies and you know I've actually had the same conversation with multiple different Folks at different companies where you know they would come to me and say like you know uh you know like M has been useful but it it's it's been a little bit problematic because we used to have a compiler team and we used to have a library team now both folks are working on on ML and and it's getting a little bit fuzzy who supposed to be doing what um you know because we have Library team writing generators for their kernels so they using ml the python bindings and whatnot the high level optimizations as as a toolbox to write some of these kernels but we have we have also have the compiler team you know using uh you know incorporating kernels as well as micr kernels you know so like tongue and cheek you know is it the success if the project causes reorgs in other organizations maybe um but you know it could also just be like I think in in a composable uh compiled Pipeline with Mod components this line just naturally gets blurry you know like it there is the uh like a common utility as well as like benefit of using these Technologies now there there's no Silver Bullet to solving this tension between like should we used Library should be you know what should we use where to what extent how much human how much you know how much automatic how much do we search and whatnot so I mean ml provides all the infrastructure to build ir and Transformations uh but and it makes it easy to iterate on IR design and try different things out but composability is still the key right so there's a big human design component on it uh composability is never perfect you know assembling entire tool chains is still work you know and there's a lot of design that actually goes into these deep learning compilers to get good performance and to ensure you know you have a predictable uh lowering path that gives you good performance so I'm going to give a couple of examples of you know where we we've sort of like tried to well we and others have Incorporated like some of these ideas and and sort of like mixed you know like kernels as well as as like compilers to try and get like better performance and and some of the things that you know worked and didn't work you know and one example we we were we had a focus looking at an ear compiled module so you have a model they were compiling it down and they were looking and they were trying to improve performance right so they did the performance analysis they they did their histograms they identified the most computationally intensive Ops they they found a high performance library that has these exact same Ops you know and folks got quite excited because they found the the kernels were actually 4X faster than the generated one so it's like this is awesome right we have 4X higher kernels we can just slap it in it's the most expensive op so this this should go awesome uh except of course reality is after a couple of weeks of integration and tuning the the end to- endend results was actually 29% worse right so like incorporating this high performance colel for the most computationally intensive part ended up in degrading the performance of the model quite significantly now why is this well for one it blocked compiler's ability to optimize you know it by by using this kernel for for these operations it it disabl the ability to do fusions it broke up tiling so now suddenly you have additional synchronization you have additional transfers uh additional serialization in the process a lot of these colel liers are actually written assuming they are the entire system you know so for example in this case this one is made internal copies of all the data before it operated on it uh and this meant like additional copies that was not required also they're written assuming they control everything so the the way they've been tuned and finally optimized you know uh it it's it's it's I think a very delicate balance to to get like the exact utilization by considering the the fretting the read write ports of the instruction dispatch etc etc Now by incorporating this and being part of a system you actually destroyed some the assumptions that has been made about locality and then also very importantly the execution model there was a mismatch between what the system was doing and the compiler was generating and what these libraries were doing right so the these libraries were being inserted where single threed functions were being generated normally because the execution model is sort of was sort of similar to how you would do with GPU multiple frights everyone SD style execution wise um and anybody who has used like Ned parallelism sort of kind of a test that it gets very tricky very quickly now what's one of the solutions I mean there's of course many you know and one of them is like actually microc kernels something that incorporates better with the compiler you know and as a disclaimer everybody claims to have microc kernels and nobody can agree what they are or the granularity that is you know we had one discussion where somebody said oh well no we have a micral you guys actually have nanoc kernels you know so like it's it's what it is but uh and I mean on the right you know the go-to pl like classical example represented like in the Bliss formalism where I go down to like the little micr as well uh in in in short you know the way I think about it is like highly optimized building blocks use in multiple different kernels Cen to produce optimized kernels or operations right so these are things that could actually have been intrinsics high level intrinsics or even Hardware units and importantly they can also stand in for Hardware units so this is a thing where like a system where you have microc kernels in your design like when you're trying to Target new hardware you know you can use the this micr kernel initially make your compiler pipeline Target it and then if it's able to Target it well then hopefully by the time it you have your Hardware unit back it's a simple switch because it's just like a oneto one flip there um like I said there's a couple of reasons why it avoids solving the hard problem generally you know perfectly scheduling all these instructions at a low level is quite difficult having a user right a couple by hand and tuning you know very aggressively is quite easy uh new isas come out compilers can necessarily evolve as quickly as you know as some of the models or adaptability you know it's not really a thing to go and hack the Isel instruction selection of a compiler right there's huge implications but changing a microc kernel or specializing for one is quite easy uh also there's many folks that can write C fewer folks that can write compilers so like being able to ex enable folks to do both we have the C and intrinsic skills to do this I think is quite useful and then it's a stepping stone to Cen to figuring out what you need in general now at Eerie in the level it's sort of like at the register level you know this actually enables the like the fusion together both fusion and granularity sorry an execution granularity sort of match it makes it easy to co-optimize and schedule and run and you know you can think of this as just high level intrinsics and compiler has to support it anyway so it's sort of like a natural fit in the system and the performance results or improvements from utilizing them is actually quite good now I mentioned that you know for a lot of the the work you know you actually need to be able to write intrinsics write the assembly you know and and one of the things that has been happening like in in in uh in tpus is the ability to write some custom TPU kernels now historically xay was very clear like hos are the set of Ops goal is to enable portability as well as a decrease compile developer space and support right make it more maintainable make it easier by constraining the upset uh you know fullback custom cels were initially introduced as a fallback you know Hal to World kind of thing right if you use them your performance degrades it's your you accept it but I think like the requirements have changed as the ml models have changed and you know this means support for custom kernels uh you know has been felt more and more and you know I I think it's it's not just in in open xay but also in P A lot of these same discussions have been happening as seen from some of the posts uh now in in Jacks itself you know the way these have been introduced is sort of like you have Palace as a high level language that lowers it it down and it's able to Target these things mosa TPU is this lower level format it's not Jackson specific it's a python interface uh effectively it's it's like syntactic sugar on the generate mlr python bindings you know folks that know think of like sort of like mem vector and aift level it's lowlevel assembly uh which is good it means you have full customer well it's good or and or bad depending on how you use it right but this gives you the o opportunity to actually do some very low-level detail things um and you know I think it's one where you know it's been uh adopted quite widely internally and you know it's actually succeeded where multiple previous approaches have actually failed and one of the reasons why is it it just it does not fight against the compiler it was designed to fit the constraints of the existing pipeline it it it had considerations made as well as ongoing improvements to increase the blurring between what is custom what is not um and also it doesn't actually reimplant parts of compiler it utilizes the the compiler passes and infrastructure so it it's it's actually beautiful feeding back into one another uh in terms of being able to increase compiler while working into it so sort of as summary and as the lightning talk comes to a lightning end you know uh you know compilers versus kernels is sort of a false dichotomy you know it it's it's uh folks often will say it is but you know I I think once one the a little bit deeper it these are two essential components to being able to produce optimized performance for the m workloads and deployments you know and we actually get better results by utilizing the strengths of both and combining them together thank you\n"