Flash Attention Machine Learning

The Power of SRAM: Unlocking Efficient Memory for AI Applications

Tokenization, compression, and dictionary creation are essential components of AI applications, including natural language processing (NLP) tasks such as text classification and sentiment analysis. By compressing input data into a dictionary, we can significantly reduce the amount of memory required to store and process large datasets. This is particularly important for AI models that require vast amounts of memory to operate effectively.

The Role of SRAM in AI Applications

SRAM (Static Random Access Memory) is an essential component of modern AI applications. With its ability to provide high-speed access to data, SRAM enables efficient processing of complex algorithms and models. In the context of AI, SRAM can be used to store and retrieve large amounts of data, reducing the need for slower memory types such as hard drives.

SRAM's Performance Advantages

One of the key benefits of using SRAM in AI applications is its high-speed access capabilities. Unlike hard drives, which rely on mechanical heads to read and write data, SRAM allows data to be accessed rapidly, reducing the time it takes to process large datasets. This is particularly important for AI models that require fast processing speeds to operate effectively.

The Impact of SRAM on AI Model Performance

By providing high-speed access to memory, SRAM enables AI models to process large amounts of data more quickly and efficiently. This can lead to significant improvements in model performance, including increased accuracy and speed. Additionally, SRAM's ability to reduce latency reduces the time it takes for AI models to make predictions, allowing for faster and more accurate decision-making.

SoftMax Scaling: A Key Component of Efficient AI Models

SoftMax scaling is a key component of efficient AI models, particularly those that use attention mechanisms. By scaling the output of the SoftMax function, we can reduce the computational requirements of these models, making them faster and more energy-efficient. This technique is essential for achieving significant performance improvements in AI applications.

The Importance of Shared Memory

Shared memory is a critical component of efficient AI models. By allocating large amounts of memory to specific parts of the model, we can reduce the need for data to be copied and pasted between different components of the model. This reduces both computational overhead and memory usage, leading to significant performance improvements.

The Impact of Shared Memory on Model Performance

Shared memory enables AI models to process large amounts of data more quickly and efficiently. By reducing the amount of data that needs to be copied and pasted between different parts of the model, we can reduce both computational overhead and memory usage. This leads to significant improvements in model performance, including increased accuracy and speed.

The Mega Kernel: A Key Component of Efficient AI Models

A mega kernel is a critical component of efficient AI models. By merging multiple operations into a single kernel, we can reduce both computational overhead and memory usage. This approach enables AI models to process large amounts of data more quickly and efficiently, leading to significant performance improvements.

The Impact of the Mega Kernel on Model Performance

The mega kernel approach enables AI models to achieve significant performance improvements. By reducing both computational overhead and memory usage, we can lead to increased accuracy and speed. Additionally, the ability to merge multiple operations into a single kernel reduces both training time and deployment time, making it easier to deploy efficient AI models.

PyTorch Extensions: Enabling Efficient AI Model Deployment

PyTorch extensions are essential for deploying efficient AI models. By providing a interface for working with tensors and other data structures, we can simplify the development and deployment of AI models. This includes enabling developers to create custom kernels and modules that can be used to build more complex AI models.

The Power of PyTorch Extensions

PyTorch extensions provide a range of benefits, including simplified development and deployment of AI models. By providing a interface for working with tensors and other data structures, we can reduce both computational overhead and memory usage. This enables developers to create more efficient AI models that can be deployed more quickly.

The Importance of the Flat Function

The flat function is an essential component of PyTorch extensions. By providing a way to access custom kernels and modules, we can simplify the development and deployment of AI models. This includes enabling developers to create custom kernels that can be used to build more complex AI models.

PyTorch Extensions: A Key Component of Efficient AI Model Deployment

PyTorch extensions are essential for deploying efficient AI models. By providing a interface for working with tensors and other data structures, we can simplify the development and deployment of AI models. This includes enabling developers to create custom kernels and modules that can be used to build more complex AI models.

The Role of Cuda Code in Efficient AI Model Deployment

Cuda code is essential for deploying efficient AI models on GPUs. By providing a interface for working with tensors and other data structures, we can simplify the development and deployment of AI models. This includes enabling developers to create custom kernels that can be used to build more complex AI models.

The Impact of Cuda Code on Model Performance

Cuda code enables AI models to achieve significant performance improvements. By providing a interface for working with tensors and other data structures, we can reduce both computational overhead and memory usage. This leads to increased accuracy and speed, making it easier to deploy efficient AI models.

In conclusion, SRAM plays a critical role in enabling efficient AI applications. By providing high-speed access to memory, SRAM enables AI models to process large amounts of data more quickly and efficiently. The mega kernel approach and PyTorch extensions are also essential for deploying efficient AI models, enabling developers to create custom kernels and modules that can be used to build more complex AI models.

"WEBVTTKind: captionsLanguage: enflash attention with all these large language models and self attention driving a Transformer model the idea behind flash attention is to dramatically accelerate the performance of these models using an approach that I have personal experience with I have done this myself the idea is to create a more efficient pipeline of transformation of the matrices and data as the input data is being transformed through each of the layers and guess what it sounds complicated because there's like there's a paper and there's a whole bunch of different authors associated with it and it says flash attention fast and efficient memory extren extraction or exact attention with IO awareness that just it's like what's going on here what are all we can actually bring it down to a really simple we can make this really simple this diagram boy does it look complicated but really the the really important part is right here at the very end is this little little chart here at the end let's zoom in on that all right so when we're building a a language model uh or any any AI model in general what you have are a series of transformations of input so your input data is say text like Words uh sent sentences like large paragraphs of text that gets transformed into uh an floating Point numbers right something that mathematically can be representated through a series of math functions the big one here is going to be matrix multiplication you're also going to have uh training training um algorithms here that help converge and generalize outputs and each of these steps are different functions and you've got pytorch here py torches can interfac via Cuda and other GPU backin that is going to require a lot of memory transfer between each of these phases and each of these steps and there are pipelining procedures that can occur in pytorch pytorch 2 does a better job of this however it's still not perfect because there are different phases of Transformations as you transform every single uh unit that's being computed mathematically every single weight weight and coefficient so when we're looking at this here we see each essentially they're all combined what it does is it merges all these different mathematical equations into one step isn't that sound a lot better A lot faster cuz what you get to do is compute these units all through one pass on the GPU sort of in an aused kernel I've done this firsthand and seen insane performance improvements and it doesn't have to move memory around it's just Computing a number based on these algorithms and then it does a final merge on the output and that data can get copied through texture memory copy back to the CPU so then you can update the model weights doesn't that sound a lot better all in one and that's like the whole deal and they call it flash attention when they should just call it good programming I think it's like flash attention it's a new technology no they just made the algorithms more efficient they just created one giant Cuda kernel that gets compiled and passed over to a compute unit and then as you pass data into it it can compute that that floating Point number in one giant phase across multiple cores simultaneously and it doesn't have to use any additional memory passing and that's where the speed comes in so we look here the memory passing can be very very slow so on the CPU range you're going to get 10 you can get faster than 10 GB a second now that does sound really fast that's pretty fast that's pretty fast memory but you have a terabyte of memory and it takes you know takes a lot of seconds minutes to get data to load into memory but in a GPU you get a terabyte a second that's a lot that's really fast so that means you can fill up the GPU memory in nearly instantly so if we can keep all of the algorithms in GPU memory and we don't have to Transit back and forth and we can leverage as much of the SRAM as possible guess what you're going to have a much faster language model like a lot faster like look at that like 100x that's 100x faster with flash attention we can use it to accelerate our machine learning models and specifically this is going to be a Transformer with self attention right where we can use scaled. product attention and accelerate that entire pipeline because the whole idea behind flash attention is just an efficient Mega kernel an all-in-one Cuda kernel that is that is written and optimized for p torch lip torch right and specifically we're not looking at the full Transformer we are looking just at the attention layer which is a a pretty substantial attention layer I would say and what I'm noticing here is uh there so here qkb is your input data right so this is going to be your input data to the model in general right what is your input data is going to be your text data right your sentences your numbers your tokens uh this is going to be say um a word a a big sentence even you could say this is the input data the main function Implement it's scale. product attention that could be your input data right a sentence these words will be transformed into tokens which are going to be uh integers right number values a dictionary essentially with an index representing the the diction the name the word in the dictionary right so this might this word right here might represent you know number 42,000 right this word right here might represent the uh the number 300 because that's where it is in the index of the dictionary once these are in in converted to tokens you essentially have your input data because now it's numbers that you can throw into an equation a mathematical model which is what machine learning is in the first place and the attention model with scale. product it looks like this it's really quite simple it's all about self attention the model can attend to itself so it can understand the input data in context as it's right because these words where they are located to each other have meaning right where this word Main and the word product could have some significance but these words combined has a much more significant meaning and so you want to attend these words with each other now q and K and V they're all the same this just me it's the same sentence so it's this sentence here and it's just an array it's a simple array right so then weend q and K to each other q&k typically have like a positional encoding so you can extract like every other word might have an interesting component in as you're representing it next to each other and then B typically has no positional encoding it's just it's just the original the original tokens so looking at the soft Max uh self attention right flash attention here is the main function all in one this is going to be our most important function here because it's all packed in one giant kernel we're going to do the matrix multiplication on Q and K then we're going to scale that with with a a scale a floating Point uh the default is going to be the square root of the dimension of the number of heads uh oh under one the dot prodct scaling aspect and then of course soft Max is going to be very useful from an output perspective uh as it helps normalize the output this is important you don't have to do this like it's not mandatory you you can skip the soft Max uh however you will very much potentially depending on how you've set up your dictionary have a large range of potential outputs that could significantly blow out the meaning so what you want to do is sort of compact all that output normalizes it's called a normalization function so that way even if you have a a really big number as the output data it'll always be between zero and one and that really big number relatively to all the other output data will be it will always add up to 100 right all those numbers all the outputs add up to 100 so this is normalizing which you it helps a lot you you will want it it's not required but it helps so this is it right here look it's really simple this very straightforward isn't it like soft Max matrix multiplication scale it and then another matrix multiplication that's it oh but it requires so much code in order for that to work it's ridiculous we get a few extra bonuses here it also includes a Dropout for training purposes you get uh ooh is that linear bias attention with linear bias it adds oh look at that this is cool so they give you a few extra bonuses here that you can use which are more modern approaches when you're using a Transformer you're applying your attention whether to apply causal attention mask all right so for autoaggressive modeling that you could apply a mask to the output and they do it in the Cuda which is great because they don't have to do that in CPU layer which sort of throws away half of the but the the typical mask that you'll see it will Zero out half of the output Matrix in a diagonal fashion because there could be some symmetry on either side and that's important because when you're multiplying q and K you're going to have symmetrical information because you're iterating attention between the word the' needs to be attended to the word Main and functions and implemented and then you do the same here with Main and the' you can see there these two words will very quickly be the same output because you're self attending every word to every other word but you'll have duplicates so you can apply a mask because it's uh it you don't need to bother uh multiplying those numbers again further down the line because it's repetitive this one I'm less familiar with it's a window size implementing siding window for local attention oh okay so this could be the attention sizing which might reduce so if you have a very large input you could have a smaller output because you're looking at more potentially uh maybe instead of a a 1 by one where you're comparing it you might have a you might be comparing two words to two words maybe I I mean that's what it sounds like to me attention with linear bias is pretty straightforward it's just uh it's just scaling it based on all right attention with linear bias actually it's a positional sort of like a positional encoding method which is actually pretty great because one of the things that we notice is that it didn't mention the a sort of cosign position on alternating tokens which is a common you know or traditional positional encoding you can use a scaling linear bias positional encoding method and so you don't have to use traditional position embeddings and then you can have the GPU code this for you O Okay so this is why it's even faster than I thought it was before and the good news is it looks like there's a default so you can just pass in qkv straight up as is without having to worry about pre positionally encoding them usually in the CPU right and you need to do that in a CPU because it's an alternation you could do this in Cuda but then you're going to have to add another layer on top of it and then remerge it and there's some extra work to do there which is tricky there's also a deterministic uh backward pass implementation which uses more memory and is slightly slower let's take a quick look at the performance differences that we see in the significant games of flash attention with an all-in-one Mega kernel in the GPU written in Cuda we see that flash attention specifically flash attention 2 is significantly FAS and it makes a lot of sense why because you're merging all that logic into one giant you know math program and that program can run on a GPU inside each of its cores without it having to reach back out to memory and pass data around so you're going to get the most potential performance and you're going to make that GPU really really toasty because you are going to leverage all of its compute power without any sleep cycles waiting for I/O and you can see here it's significantly faster so we're looking specifically we want to look at the best possible scenario most common scenario 2 and that is going to be in the case where we are looking at a full attention forward pass or back pass mostly a forward pass where forward attention and backward speed on an a100 with 80 gb uh with the causal mask and 120 attention Dimension heads and we can see here as your input sequence length goes bigger and bigger we see a nice increase in performance game capability as distributive capacity of flash attention to it's going to just Rise Above the Rest because of its ability to merge all these tasks in one we're going to see slightly less interesting still still some performance gains here compared to Pi torch as is right the standard Pi torch where it has to do some of the work in one of the uh math functions like matrix multiplication then it moves over to Dropout then it moves over to soft max if soft Max is even available in uh it might be it might not be uh available right it's they've imp bled soft Max with flash attention inside of Cuda from what I understand I'm just trying to think of how they go about because when you're working with uh every say say you're working with the Shader and you you're working with a a pixel with you know RGB values maybe Alpha as well so you have a uh a four-dimensional array that's all you really have access to for each core just basically one pixel at a time you understand it's position and location within that texture and you also have the values the rgba values but you don't know any of the others you don't know anything else about any of the others so it's possible that we could be leveraging one of these channels to indicate um a max value so that way we understand how to scale it in a soft Max perspective so there's a potential performance gain there and that's really where you know flash attention comes into the wind is being able to run all that logic in the GPU core kernel directly without you having to jump back over to the CPU memory cuz that latency on the io is what really gets us that's where the next the next Evolution comes in and and this is just for the attention layers not talking about the entire Transformer imagine how cool we can get if we implemented a mega Giga kernel allinone super super core super uh kernel oh that would be so cool so they succeeded with flash attention and just the attention layer the number of Tera flops a second is just phenomenal you just compare this like 23 Tera flops versus 100 right that's a 5x we're seeing like 510 oh that's this is this is beautiful and we're seeing roughly a 5x performance boost just using it as is with this which is just phenomenal that's just over over standard pie torch and with uh and that's flash attention two right but flash attention one you still get a 2X Improvement it's amazing these These are phenomenal numbers just so good and if we take a peek let's just simplify the chart here really quick so we kind of see a little bit better uh this is running on the h100 with 80 GB of RAM you know a nice beefy GPU here we see that pytorch does a pretty good job that's still a lot of Tera flops don't get me wrong right 62 Tera flops a lot of Tera flops but when you compare that with flash attenion and flash attention to you would much prefer to Leverage The Flash tension Mega kernel over a p torch any day I mean just look at this be fantastic performance gain uh so what we can do is we can leverage this approach and this technology uh for more of the layers in a Transformer model to significantly improve the performance and reduce the waste of time it takes as layers transit in between each of the mathematical equations need to run on all the matrices oh check this out we've got something new we got a a new Flash attention minimal that we can take a look at and look at the the Cuda code which looks a lot like C++ I think actually it might just be it is oh it really is its own language uh it's CU right so we've got a Cuda language what we can do is take a look at this flash attention minimal it's an implementation someone wrote in Cuda that takes advantage of the forward pass only not the back pass obviously back path is going to be a little more intense because there's back propagation which requires the inverse but you also have to store the gradients and remember the gradients as you're going backwards so that way you can use those gradient informations right those matrices to find the Deltas of how far off the output was so you can train the model this flash attention minimal does not have that part it only has forward pass which allows you to run inference right so you can just ask questions for the model and the model will give you an answer that's all it can do you cannot train a model with this approach you'll need more than 100 lines of Cuda but if you look at the profiling really quick taking a look at just normal P torch with CPU and Cuda time the profiling on manual attention just regular attention uh algorithm and we know that algorithm looks like this a very simple straightforward algorithm soft Max dot product with q and K which is self attention of a sentence and then you scale it based on uh the number of heads I think it was right the square root of the number of Dimensions over uh under one and then you apply soft Max to that Matrix and then you dot product again this this output with the attention of the sentences again so it's pretty straightforward algorithm really powerful self attention right and look at what you get a significant performance Improvement profiling the minimal flash attention what is this what is that speed all right 4 milliseconds compared to like 52 milliseconds right it's like 13x Improvement wow that's because what happens is you put all of the mathematical operations into one Mega kernel rather than yeah so here's obviously the drawback is it can only do this one thing really well right it's like a super function that takes a bunch of common algorithms into one and makes it one giant uh kernel that can be used and install in all the cores of the GPU so when you're putting the input data in it's going to run all those algorithms together in one pass without having to Transit memory and which that's the slow part the io taking a look at our Cuda oh nice look at that okay so they give us they give us almost everything that we need here it would be nice if the variable names were a little bit little bit extra obviously we're fine with qk and V because we know that's just the input tokens do we have a tension with linear bias that we got with the other one no doesn't look like we have that one thing I'm noticing here straight off is that it since this is only a forward pass kernel and there's no back propagation it seems odd that we have batching because if we're self attending and it's sort of expecting to do what you know large language models do where they run through one pass get the output in a full Transformer and then pin the output to the input again and then run it through again to get the next word right the next token you don't need to batch that batch is only really good for training perspective so this could be you know fairly interesting from a future perspective to support batch but this like one optimization I noticed so we have our thread and block IDs in Cuda world this is going to give us information about where our current core is running within our kernel right we've got our forward kernel it's like which one of these cores on the GPU are we are we located in and we use that information to be able to access and perform mathematical functions on each of the essentially pixels right of of the Matrix and the Matrix is going to be a representation of our input data which is the sentence and the words right the tokens and each one of those uh each one of these threads is going to be able to work just on that one word at like that one input at a time because you don't want to do uh that's how like the parallelization work right that's how it works in a GPU you able to split out all the work uh across multiple cores so it can run simultaneously like all at once let's do a matrix multiplications let's do all that work all at once and then we merge the output boom there we go performance Improvement that's really where the power of the GPU comes in is because it's sort of breaking down it allows you to write code to break down uh as much as possible the amount of work into a finite amount of space you want to be able to spread out that workload and you know get the most out of it uh by making as small of a chunk as possible because then you can just keep scaling up the number of cores in a GPU and you can add more parallelism and then you get really really fast output so I see we're using SRAM here which is going to be really powerful we want to store all of our Ford P all the input data right or qkb we want to store that in SRAM which is a small amount of ram but guess what it's really fast how fast oh let's see here we saw that earlier uh we are seeing upwards of 20 terabytes a second that's a lot you get 20 megabytes but this is going to be that's not a lot right that's not a lot of uh of memory but it's more than enough for the input data right so you're going to have like maybe a document and you're going to tokenize that document which is essentially compressing it compressing it into a dictionary and so you can have a large input size using SRAM alone and SRAM is going to give you the extreme power SRAM is sharable as well so we're going to move all that um input data into SRAM and now all of our cores can access the uh the input this is so cool so now we can even do we can is H soft Max going to be in here they better have it it better be here ah okay yep yeah all right so we're getting our Max we need to find the max we definitely need SRAM for this we need to look over all the srams we're going to get to find the maximum number which allows us to scale oh yep here we go soft Max boom we ourselves a soft Max right there I don't want to read this it's just too all these it's just a bunch of variable names like it's hard to keep track of what all this is this looks like it's okay so it's finishing off the uh self attention scale dot product here leveraging with access to SRAM which is really powerful a really performant memory something that I don't necessarily like is the syns the need to use sync threads here such that uh I mean I I get it I would say probably run that once on import for the import data but inside of a for Loop seems problematic to me uh I don't like this implementation I feel like the kernel can be significantly improved if we eliminate the need for sink threads altogether uh it's especially inside of a for Loop but I like this part I really like this part so we get a Ford path has tensor with qkb this is really powerful because this is the tensor interface that you're going to get in pytorch just a simple forward and all you have to do is pass in qk and V which is the input data right and then you get the output as it executes the kernel uh what work does it do so what work does it do here before it passes it in so on the CPU we calculate uh we we populate uh matrices we create some matrices and then we're going to pass those matrices to the gpu's memory and then we do some preparation to get the the memory amount size uh looks like we also get the soft Max scale what are we doing here all right we're getting Dimensions so we understand what how big the Matrix is going to be for the input data I like this this is nice to have it really is because when I pass in my input data I don't want to also have to pass in all this extra work that the function could just determine itself based on the input data alone I don't want to have to say it's this size and that size thank you for doing this for me and then once Ram is all populated and we uh have a little bit of extra information here we run the kernel GPU function that's this this giant math equation up here and then we get the output oh is a data pointer so if we weren't sure before we are now we can see kind of the idea of a minimal flash attention implementation in Kudo directly you can see that just merges together a bunch of functions into one giant mega kernel and allows also it you know puts all the input data into as shared memory that's really really fast um we could make some significant performance improvements here I do see uh by eliminating sing threads overall that's the whole idea why does Flash attention so effective it merges multiple operations into one kernel so you don't have to waste time on iO that's immensely powerful it's really fast now imagine we can do that for all the rest of the phases our Transformer model making it even faster and that's possible we could totally do that so this proves that we can do that with the rest of the thing and imagine we just did one mega super kernel now we're approaching the future oh and just in case you wondering how pytorch can import this forward past main. CP uh main uh C++ file essentially just includes torch extensions uh and it identifies the uh the Prototype and then it runs pie bind ah neat and then we import okay this is kind of cool so we take the pytorch extensions the C++ extensions we we ort the load function and then we get ourselves that forward pass uh C++ main definition through the flat yep and this minimal attention function is now our forward pass and just in case you wanted to see what the difference was on terms of how much python code you need compared to how much uh c c code you need and Cuda code you need it's a lot more I mean just look at this isn't isn't this amazing it's just you know a few lines of code compared to over 100 lines of code that's required to get it to work on a GPUflash attention with all these large language models and self attention driving a Transformer model the idea behind flash attention is to dramatically accelerate the performance of these models using an approach that I have personal experience with I have done this myself the idea is to create a more efficient pipeline of transformation of the matrices and data as the input data is being transformed through each of the layers and guess what it sounds complicated because there's like there's a paper and there's a whole bunch of different authors associated with it and it says flash attention fast and efficient memory extren extraction or exact attention with IO awareness that just it's like what's going on here what are all we can actually bring it down to a really simple we can make this really simple this diagram boy does it look complicated but really the the really important part is right here at the very end is this little little chart here at the end let's zoom in on that all right so when we're building a a language model uh or any any AI model in general what you have are a series of transformations of input so your input data is say text like Words uh sent sentences like large paragraphs of text that gets transformed into uh an floating Point numbers right something that mathematically can be representated through a series of math functions the big one here is going to be matrix multiplication you're also going to have uh training training um algorithms here that help converge and generalize outputs and each of these steps are different functions and you've got pytorch here py torches can interfac via Cuda and other GPU backin that is going to require a lot of memory transfer between each of these phases and each of these steps and there are pipelining procedures that can occur in pytorch pytorch 2 does a better job of this however it's still not perfect because there are different phases of Transformations as you transform every single uh unit that's being computed mathematically every single weight weight and coefficient so when we're looking at this here we see each essentially they're all combined what it does is it merges all these different mathematical equations into one step isn't that sound a lot better A lot faster cuz what you get to do is compute these units all through one pass on the GPU sort of in an aused kernel I've done this firsthand and seen insane performance improvements and it doesn't have to move memory around it's just Computing a number based on these algorithms and then it does a final merge on the output and that data can get copied through texture memory copy back to the CPU so then you can update the model weights doesn't that sound a lot better all in one and that's like the whole deal and they call it flash attention when they should just call it good programming I think it's like flash attention it's a new technology no they just made the algorithms more efficient they just created one giant Cuda kernel that gets compiled and passed over to a compute unit and then as you pass data into it it can compute that that floating Point number in one giant phase across multiple cores simultaneously and it doesn't have to use any additional memory passing and that's where the speed comes in so we look here the memory passing can be very very slow so on the CPU range you're going to get 10 you can get faster than 10 GB a second now that does sound really fast that's pretty fast that's pretty fast memory but you have a terabyte of memory and it takes you know takes a lot of seconds minutes to get data to load into memory but in a GPU you get a terabyte a second that's a lot that's really fast so that means you can fill up the GPU memory in nearly instantly so if we can keep all of the algorithms in GPU memory and we don't have to Transit back and forth and we can leverage as much of the SRAM as possible guess what you're going to have a much faster language model like a lot faster like look at that like 100x that's 100x faster with flash attention we can use it to accelerate our machine learning models and specifically this is going to be a Transformer with self attention right where we can use scaled. product attention and accelerate that entire pipeline because the whole idea behind flash attention is just an efficient Mega kernel an all-in-one Cuda kernel that is that is written and optimized for p torch lip torch right and specifically we're not looking at the full Transformer we are looking just at the attention layer which is a a pretty substantial attention layer I would say and what I'm noticing here is uh there so here qkb is your input data right so this is going to be your input data to the model in general right what is your input data is going to be your text data right your sentences your numbers your tokens uh this is going to be say um a word a a big sentence even you could say this is the input data the main function Implement it's scale. product attention that could be your input data right a sentence these words will be transformed into tokens which are going to be uh integers right number values a dictionary essentially with an index representing the the diction the name the word in the dictionary right so this might this word right here might represent you know number 42,000 right this word right here might represent the uh the number 300 because that's where it is in the index of the dictionary once these are in in converted to tokens you essentially have your input data because now it's numbers that you can throw into an equation a mathematical model which is what machine learning is in the first place and the attention model with scale. product it looks like this it's really quite simple it's all about self attention the model can attend to itself so it can understand the input data in context as it's right because these words where they are located to each other have meaning right where this word Main and the word product could have some significance but these words combined has a much more significant meaning and so you want to attend these words with each other now q and K and V they're all the same this just me it's the same sentence so it's this sentence here and it's just an array it's a simple array right so then weend q and K to each other q&k typically have like a positional encoding so you can extract like every other word might have an interesting component in as you're representing it next to each other and then B typically has no positional encoding it's just it's just the original the original tokens so looking at the soft Max uh self attention right flash attention here is the main function all in one this is going to be our most important function here because it's all packed in one giant kernel we're going to do the matrix multiplication on Q and K then we're going to scale that with with a a scale a floating Point uh the default is going to be the square root of the dimension of the number of heads uh oh under one the dot prodct scaling aspect and then of course soft Max is going to be very useful from an output perspective uh as it helps normalize the output this is important you don't have to do this like it's not mandatory you you can skip the soft Max uh however you will very much potentially depending on how you've set up your dictionary have a large range of potential outputs that could significantly blow out the meaning so what you want to do is sort of compact all that output normalizes it's called a normalization function so that way even if you have a a really big number as the output data it'll always be between zero and one and that really big number relatively to all the other output data will be it will always add up to 100 right all those numbers all the outputs add up to 100 so this is normalizing which you it helps a lot you you will want it it's not required but it helps so this is it right here look it's really simple this very straightforward isn't it like soft Max matrix multiplication scale it and then another matrix multiplication that's it oh but it requires so much code in order for that to work it's ridiculous we get a few extra bonuses here it also includes a Dropout for training purposes you get uh ooh is that linear bias attention with linear bias it adds oh look at that this is cool so they give you a few extra bonuses here that you can use which are more modern approaches when you're using a Transformer you're applying your attention whether to apply causal attention mask all right so for autoaggressive modeling that you could apply a mask to the output and they do it in the Cuda which is great because they don't have to do that in CPU layer which sort of throws away half of the but the the typical mask that you'll see it will Zero out half of the output Matrix in a diagonal fashion because there could be some symmetry on either side and that's important because when you're multiplying q and K you're going to have symmetrical information because you're iterating attention between the word the' needs to be attended to the word Main and functions and implemented and then you do the same here with Main and the' you can see there these two words will very quickly be the same output because you're self attending every word to every other word but you'll have duplicates so you can apply a mask because it's uh it you don't need to bother uh multiplying those numbers again further down the line because it's repetitive this one I'm less familiar with it's a window size implementing siding window for local attention oh okay so this could be the attention sizing which might reduce so if you have a very large input you could have a smaller output because you're looking at more potentially uh maybe instead of a a 1 by one where you're comparing it you might have a you might be comparing two words to two words maybe I I mean that's what it sounds like to me attention with linear bias is pretty straightforward it's just uh it's just scaling it based on all right attention with linear bias actually it's a positional sort of like a positional encoding method which is actually pretty great because one of the things that we notice is that it didn't mention the a sort of cosign position on alternating tokens which is a common you know or traditional positional encoding you can use a scaling linear bias positional encoding method and so you don't have to use traditional position embeddings and then you can have the GPU code this for you O Okay so this is why it's even faster than I thought it was before and the good news is it looks like there's a default so you can just pass in qkv straight up as is without having to worry about pre positionally encoding them usually in the CPU right and you need to do that in a CPU because it's an alternation you could do this in Cuda but then you're going to have to add another layer on top of it and then remerge it and there's some extra work to do there which is tricky there's also a deterministic uh backward pass implementation which uses more memory and is slightly slower let's take a quick look at the performance differences that we see in the significant games of flash attention with an all-in-one Mega kernel in the GPU written in Cuda we see that flash attention specifically flash attention 2 is significantly FAS and it makes a lot of sense why because you're merging all that logic into one giant you know math program and that program can run on a GPU inside each of its cores without it having to reach back out to memory and pass data around so you're going to get the most potential performance and you're going to make that GPU really really toasty because you are going to leverage all of its compute power without any sleep cycles waiting for I/O and you can see here it's significantly faster so we're looking specifically we want to look at the best possible scenario most common scenario 2 and that is going to be in the case where we are looking at a full attention forward pass or back pass mostly a forward pass where forward attention and backward speed on an a100 with 80 gb uh with the causal mask and 120 attention Dimension heads and we can see here as your input sequence length goes bigger and bigger we see a nice increase in performance game capability as distributive capacity of flash attention to it's going to just Rise Above the Rest because of its ability to merge all these tasks in one we're going to see slightly less interesting still still some performance gains here compared to Pi torch as is right the standard Pi torch where it has to do some of the work in one of the uh math functions like matrix multiplication then it moves over to Dropout then it moves over to soft max if soft Max is even available in uh it might be it might not be uh available right it's they've imp bled soft Max with flash attention inside of Cuda from what I understand I'm just trying to think of how they go about because when you're working with uh every say say you're working with the Shader and you you're working with a a pixel with you know RGB values maybe Alpha as well so you have a uh a four-dimensional array that's all you really have access to for each core just basically one pixel at a time you understand it's position and location within that texture and you also have the values the rgba values but you don't know any of the others you don't know anything else about any of the others so it's possible that we could be leveraging one of these channels to indicate um a max value so that way we understand how to scale it in a soft Max perspective so there's a potential performance gain there and that's really where you know flash attention comes into the wind is being able to run all that logic in the GPU core kernel directly without you having to jump back over to the CPU memory cuz that latency on the io is what really gets us that's where the next the next Evolution comes in and and this is just for the attention layers not talking about the entire Transformer imagine how cool we can get if we implemented a mega Giga kernel allinone super super core super uh kernel oh that would be so cool so they succeeded with flash attention and just the attention layer the number of Tera flops a second is just phenomenal you just compare this like 23 Tera flops versus 100 right that's a 5x we're seeing like 510 oh that's this is this is beautiful and we're seeing roughly a 5x performance boost just using it as is with this which is just phenomenal that's just over over standard pie torch and with uh and that's flash attention two right but flash attention one you still get a 2X Improvement it's amazing these These are phenomenal numbers just so good and if we take a peek let's just simplify the chart here really quick so we kind of see a little bit better uh this is running on the h100 with 80 GB of RAM you know a nice beefy GPU here we see that pytorch does a pretty good job that's still a lot of Tera flops don't get me wrong right 62 Tera flops a lot of Tera flops but when you compare that with flash attenion and flash attention to you would much prefer to Leverage The Flash tension Mega kernel over a p torch any day I mean just look at this be fantastic performance gain uh so what we can do is we can leverage this approach and this technology uh for more of the layers in a Transformer model to significantly improve the performance and reduce the waste of time it takes as layers transit in between each of the mathematical equations need to run on all the matrices oh check this out we've got something new we got a a new Flash attention minimal that we can take a look at and look at the the Cuda code which looks a lot like C++ I think actually it might just be it is oh it really is its own language uh it's CU right so we've got a Cuda language what we can do is take a look at this flash attention minimal it's an implementation someone wrote in Cuda that takes advantage of the forward pass only not the back pass obviously back path is going to be a little more intense because there's back propagation which requires the inverse but you also have to store the gradients and remember the gradients as you're going backwards so that way you can use those gradient informations right those matrices to find the Deltas of how far off the output was so you can train the model this flash attention minimal does not have that part it only has forward pass which allows you to run inference right so you can just ask questions for the model and the model will give you an answer that's all it can do you cannot train a model with this approach you'll need more than 100 lines of Cuda but if you look at the profiling really quick taking a look at just normal P torch with CPU and Cuda time the profiling on manual attention just regular attention uh algorithm and we know that algorithm looks like this a very simple straightforward algorithm soft Max dot product with q and K which is self attention of a sentence and then you scale it based on uh the number of heads I think it was right the square root of the number of Dimensions over uh under one and then you apply soft Max to that Matrix and then you dot product again this this output with the attention of the sentences again so it's pretty straightforward algorithm really powerful self attention right and look at what you get a significant performance Improvement profiling the minimal flash attention what is this what is that speed all right 4 milliseconds compared to like 52 milliseconds right it's like 13x Improvement wow that's because what happens is you put all of the mathematical operations into one Mega kernel rather than yeah so here's obviously the drawback is it can only do this one thing really well right it's like a super function that takes a bunch of common algorithms into one and makes it one giant uh kernel that can be used and install in all the cores of the GPU so when you're putting the input data in it's going to run all those algorithms together in one pass without having to Transit memory and which that's the slow part the io taking a look at our Cuda oh nice look at that okay so they give us they give us almost everything that we need here it would be nice if the variable names were a little bit little bit extra obviously we're fine with qk and V because we know that's just the input tokens do we have a tension with linear bias that we got with the other one no doesn't look like we have that one thing I'm noticing here straight off is that it since this is only a forward pass kernel and there's no back propagation it seems odd that we have batching because if we're self attending and it's sort of expecting to do what you know large language models do where they run through one pass get the output in a full Transformer and then pin the output to the input again and then run it through again to get the next word right the next token you don't need to batch that batch is only really good for training perspective so this could be you know fairly interesting from a future perspective to support batch but this like one optimization I noticed so we have our thread and block IDs in Cuda world this is going to give us information about where our current core is running within our kernel right we've got our forward kernel it's like which one of these cores on the GPU are we are we located in and we use that information to be able to access and perform mathematical functions on each of the essentially pixels right of of the Matrix and the Matrix is going to be a representation of our input data which is the sentence and the words right the tokens and each one of those uh each one of these threads is going to be able to work just on that one word at like that one input at a time because you don't want to do uh that's how like the parallelization work right that's how it works in a GPU you able to split out all the work uh across multiple cores so it can run simultaneously like all at once let's do a matrix multiplications let's do all that work all at once and then we merge the output boom there we go performance Improvement that's really where the power of the GPU comes in is because it's sort of breaking down it allows you to write code to break down uh as much as possible the amount of work into a finite amount of space you want to be able to spread out that workload and you know get the most out of it uh by making as small of a chunk as possible because then you can just keep scaling up the number of cores in a GPU and you can add more parallelism and then you get really really fast output so I see we're using SRAM here which is going to be really powerful we want to store all of our Ford P all the input data right or qkb we want to store that in SRAM which is a small amount of ram but guess what it's really fast how fast oh let's see here we saw that earlier uh we are seeing upwards of 20 terabytes a second that's a lot you get 20 megabytes but this is going to be that's not a lot right that's not a lot of uh of memory but it's more than enough for the input data right so you're going to have like maybe a document and you're going to tokenize that document which is essentially compressing it compressing it into a dictionary and so you can have a large input size using SRAM alone and SRAM is going to give you the extreme power SRAM is sharable as well so we're going to move all that um input data into SRAM and now all of our cores can access the uh the input this is so cool so now we can even do we can is H soft Max going to be in here they better have it it better be here ah okay yep yeah all right so we're getting our Max we need to find the max we definitely need SRAM for this we need to look over all the srams we're going to get to find the maximum number which allows us to scale oh yep here we go soft Max boom we ourselves a soft Max right there I don't want to read this it's just too all these it's just a bunch of variable names like it's hard to keep track of what all this is this looks like it's okay so it's finishing off the uh self attention scale dot product here leveraging with access to SRAM which is really powerful a really performant memory something that I don't necessarily like is the syns the need to use sync threads here such that uh I mean I I get it I would say probably run that once on import for the import data but inside of a for Loop seems problematic to me uh I don't like this implementation I feel like the kernel can be significantly improved if we eliminate the need for sink threads altogether uh it's especially inside of a for Loop but I like this part I really like this part so we get a Ford path has tensor with qkb this is really powerful because this is the tensor interface that you're going to get in pytorch just a simple forward and all you have to do is pass in qk and V which is the input data right and then you get the output as it executes the kernel uh what work does it do so what work does it do here before it passes it in so on the CPU we calculate uh we we populate uh matrices we create some matrices and then we're going to pass those matrices to the gpu's memory and then we do some preparation to get the the memory amount size uh looks like we also get the soft Max scale what are we doing here all right we're getting Dimensions so we understand what how big the Matrix is going to be for the input data I like this this is nice to have it really is because when I pass in my input data I don't want to also have to pass in all this extra work that the function could just determine itself based on the input data alone I don't want to have to say it's this size and that size thank you for doing this for me and then once Ram is all populated and we uh have a little bit of extra information here we run the kernel GPU function that's this this giant math equation up here and then we get the output oh is a data pointer so if we weren't sure before we are now we can see kind of the idea of a minimal flash attention implementation in Kudo directly you can see that just merges together a bunch of functions into one giant mega kernel and allows also it you know puts all the input data into as shared memory that's really really fast um we could make some significant performance improvements here I do see uh by eliminating sing threads overall that's the whole idea why does Flash attention so effective it merges multiple operations into one kernel so you don't have to waste time on iO that's immensely powerful it's really fast now imagine we can do that for all the rest of the phases our Transformer model making it even faster and that's possible we could totally do that so this proves that we can do that with the rest of the thing and imagine we just did one mega super kernel now we're approaching the future oh and just in case you wondering how pytorch can import this forward past main. CP uh main uh C++ file essentially just includes torch extensions uh and it identifies the uh the Prototype and then it runs pie bind ah neat and then we import okay this is kind of cool so we take the pytorch extensions the C++ extensions we we ort the load function and then we get ourselves that forward pass uh C++ main definition through the flat yep and this minimal attention function is now our forward pass and just in case you wanted to see what the difference was on terms of how much python code you need compared to how much uh c c code you need and Cuda code you need it's a lot more I mean just look at this isn't isn't this amazing it's just you know a few lines of code compared to over 100 lines of code that's required to get it to work on a GPU\n"