Hacks to Make LLM Training Faster - Daniel Han, Unsloth AI

**Advantages of Tensor Cores**

Tensor cores are very useful, especially when it comes to HMMA and tensor instructions. These instructions essentially pack and can do many multiplications in the actual GPU, reducing overhead significantly. Using tensor cores can greatly improve training speed.

**Choosing the Right GPU**

When choosing a GPU for training, it's essential to select the right one. Tesla T4s are much faster than P100s, making them the preferred choice for training tasks. Kagle is another platform that takes advantage of Tesla T4s, which has been a constant question among users.

**Algorithms for Faster Training**

There are many algorithms that can be used to make training faster. One effective method is to create deep and thin networks, which can achieve high accuracy without increasing the number of parameters. Depth is necessary for reasoning, and model size matters for knowledge capacity. A deeper network can perform more reasoning tasks, but a larger model can also increase its knowledge capacity.

**Training Algorithms**

Several training algorithms have been found to be useful for faster training. GB2+, RoPE+, and No Dropout are effective methods that have shown good results in reducing training time. Gated MLPs and Swiglo can be challenging to train, but with the right techniques, they can produce accurate results.

**Activation Functions**

Using different activation functions does not significantly affect accuracy, making them a viable option for faster training. However, it's essential to note that some activation functions may have a slight impact on performance.

**Flash Attention**

Flash attention is another technique that can be used to improve training speed. This method has been shown to reduce the number of flops required during training while maintaining accuracy.

**Fine-Tuning and Unsupervised Training**

Fine-tuning and unsupervised training are two techniques that can be used to improve training speed. Unsupervised training involves using a checkpointing technique called "chunked cross entropy," which reduces actual flops while maintaining accuracy. Fine-tuning involves using a technique called "unsplashed grading checkpointing," which offloads activation to system RAM asynchronously.

**Character AI and Prompt Caching**

The recent release of Character AI has been making waves in the field. The team behind it demonstrated how to create inference FAS, which can be used to speed up training tasks. Using multiquery attention, six global attention layers, and cross-layer KV sharing can significantly improve training speed.

**Using VM for Kernels**

VM (Virtual Memory) is another technique that can be used to improve training speed. Using VM for kernels can reduce memory usage and increase the number of kernels that can be fused together. This can lead to significant reductions in flops while maintaining accuracy.

**Fusing Kernels**

Fusing kernels, such as RMS Lu fusion, rope embedding fusion, and fuse lower for less flops, can also improve training speed. These techniques have been shown to reduce flops while maintaining accuracy.

**Recent Trends**

The recent trend in the field is to use high-quality data for faster training. The more high-quality data available, the faster training will be. Great work has been done by the Hugging Face team on this topic.

**Future Directions**

Looking ahead, float 4 and float 6 are expected to be challenging to train due to their complexity. However, researchers have shown that it is possible to train these models using certain techniques. As the field continues to evolve, we can expect to see more research on these topics.

**Advances in Hardware**

The bit representation of hardware is expected to be at its limits soon. This means that floating-point numbers with a higher precision will become increasingly difficult to train. The next step will be to explore new techniques for training models with lower precision.

**Conclusion**

In conclusion, using tensor cores, choosing the right GPU, and selecting the right algorithms can significantly improve training speed. Fine-tuning and unsupervised training techniques, such as chunked cross entropy and unsplashed grading checkpointing, can also be used to reduce flops while maintaining accuracy. Using VM for kernels and fusing kernels can further improve training speed. As the field continues to evolve, we can expect to see more advances in hardware and new techniques for training models with lower precision.

**Resources**

For those interested in learning more about tensor cores and training algorithms, there are many resources available online, including GitHub packages and Twitter accounts. Our team is always happy to chat with users and provide guidance on the best practices for training models.

"WEBVTTKind: captionsLanguage: enso hello I'm Daniel from anso and today I'm going to tell you hacks to make llm training faster um so if you don't know who I am um I tweet on Twitter so if you want to follow my Twitter account um that also works um so I like found eight bugs in Gemma I help fix them I also found like some bugs in llama Mist tokenization issues um yeah and I love finding bugs um and technically unof is a brother team so my brother's also here if you want to talk to him um we just came from Australia like two months ago um and yeah SF great so the famous you know plot like everyone knows this image right the scaling laws um the more you train the lower the loss um yeah so like the question is how do we make this like process faster so I don't know if you saw this image before but like you know gpus are getting faster and faster there's more transistors um but like it's not actually the gpus itself is getting faster there's actually methods behind them that make it faster so there are like I'm going to talk about like six different things um the first one is bit representation so if you want to reduce the bit representation you can actually make stuff faster and also reduce vram Hardware is also very important so tensor calls makes the process faster um algorithms is also very important um that makes trading faster um and it doesn't like you know reduce accuracy kernels fusing and high quality data has been recently a Hot Topic um the fine web data set um so yes that's very useful as well and I'm also going to talk about the future as well what I think like hypothesis and stuff um that is probably going to be in the future so firstly on bit representation right so like in the olden days everyone just uses float 32 does everyone remember that um so you know exponent 8 Bits manesa 23 bits generally speaking this is just a rough estimate mantisa squar is how many transistors approximately the GPU like will use right so like 23 S you know 529 Plus 8 so the um you know the approximate complexity is like 537 transistors right it's it's not exact but like it's around there right float 16 reduced the mentia a lot um the main the main importance is you need to reduce the mantisa bits because you need to have squared the number of mantisa um transistors right so like float 16 is essentially five times faster than float 32 right not two times it's five because there's you know five times less transistors B FL 16 is actually technically faster than float 16 but if you look at the actual numbers for like Nvidia gpus AMD gpus float 16 and B flat 16 the ter flops are like similar um it's still a I I'm not sure like you know maybe B 16 should have more ter flops but I'm assuming they like use the same um circuits and stuff like that and it's probably easier to put into the GPU floate is also a new um you know measure that people really like now um that you know there's like two different formats there's you know four exponent and there's a three uh there's a five and so the transistors are actually different from each um and also there's a new format for the b100s right so like they float four um and it's actually not Flo 4 there's actually scaling factor for each 32 numbers um so that's actually supposedly 180 times less transistors um but it doesn't mean it's 180 times faster right so like there's heat there's other things that's um there and also like you know 1.58 bit um that's a you know everyone's talking about 1.58 bit um in my take is actually it's not going to be that much faster than float 4 um because the number of Manti bits is already one so you can't actually go that much faster anymore also there is like a physics paper called the physics of llm part 3.3 I I think it's from the The Meta team as well they show that if you do quantization until in8 um you don't lose any accuracy but if you go to in four you actually lose two times um knowledge um capacity um the trick is is you have to add Laura and Cura so when you find tune using in four you can then recover all the accuracy so tensor calls you know Nvidia tensor calls they're very useful um especially the hmma and like all the tens instructions they essentially pack they can do maj multiplications in the actual GPU and this reduces overhead a lot um so yes definitely use the tentacles um please do not use the p100s just use the Tesla t4s on kagle this has been a constant question I always get Tesla t4s and kagle are much faster than P 100s um algorithm so now on to algorithms so if there are lots of algorithms that you can do to make training faster for example if you swiglo um if you make um deep and thin networks you can essentially have high accuracy and and the number of parameters doesn't change so there are many methods to make um you know train training faster as well um I think I tweeted out this recently um depth is necessary for reasoning and model size Matters for knowledge so the you know the deeper your network the more the more reasoning it can do um you know quotation reasoning um but but if you make the model larger you can also like have knowledge capacity um so I tweeted about this as well um and so essentially what this means is you make your model as thin as possible make it very deep um and it can still have the same time time time complexity and it's good for reasoning so for other algorithms um gb2 plus rope plus no Dropout is very useful gated MLPs and Swig glue they're actually very hard to train um according to the paper so if you're larger models it's not that bad um if you use different activation functions it doesn't really change accuracy biases doesn't really do anything and yes you definitely must use Flash attention for unso so we actually make training faster as well fine tuning um we um something called unso grading checkpointing um which essentially offloads the activation to system Ram asynchronously um this only increases um the you know time for training by like one to two% chunked cross entropy is also very good um chain matrix multiplication actually reduces actual flops um it does not reduce accuracy but you can actually reduce flops if you bracket correctly so for example if you use unsoft grading checkpointing um you get like the green line um which is your memory usage and you can essentially make memory usage much de decreased and you can do long context fine tuning as well character AI released I don't know if you guys saw the character AI blog they showed how they like made inference FAS so if you use like multiquery attention um you know six Global attention plus the rest lighting window cross layer KV sharing and also VM you know everyone should use VM they have prompt caching pref C prefix caching um you know like Claude said that they have you know prefix caching VM already had it um and chunk prefill so definitely use VM for kernels also you should fuse all the kernels for example in unof we do RMS Lor fusing rope embedding fusing fuse lower for Less flops SG glue and also we um Torch compell Fantastic as well so um we do that a lot of this in unop as well and also yes the recent trend is high quality data right the F web data said The more high quality data set you can get the you know the faster the training is um so great work for the hugging face team from this um so yes definitely use high quality data okay so for my hypothesis for the future my view is float 4 and Float 6 will be actually very hard to train um there is actually research papers to show that float 6 and Float 4 is a bit complicated to do um we still haven't solved the float 8 problem so let see if we can do that um it probably is possible to train but it's going to be hard um also more people will use more sliding window attentions they would use Global attenion but I think that will decrease people will more shift Focus to more data better high quality data and also um deeper models so there will be more layers um thinner models but more layers um and essentially this does not reduce the number of flops and also my this is my hot take but I think Hardware is kind of at its limits um the bit representation was the main driver of performance and like you can go to float two float one and then like what's next float zero like that's not correct so float four is like you know the very limits of Hardware um yeah and so definitely check us out we have a GitHub package as well I'm on Twitter you can definitely talk to me and my brother as well um and yes we love llama we upload all the Llama models bits and bites for a bit quantization I think Tim's de bits and bites very good um you can reduce you can make downloads four times faster as well um and yeah we upload our models on hugging face um thanks for listening yeahso hello I'm Daniel from anso and today I'm going to tell you hacks to make llm training faster um so if you don't know who I am um I tweet on Twitter so if you want to follow my Twitter account um that also works um so I like found eight bugs in Gemma I help fix them I also found like some bugs in llama Mist tokenization issues um yeah and I love finding bugs um and technically unof is a brother team so my brother's also here if you want to talk to him um we just came from Australia like two months ago um and yeah SF great so the famous you know plot like everyone knows this image right the scaling laws um the more you train the lower the loss um yeah so like the question is how do we make this like process faster so I don't know if you saw this image before but like you know gpus are getting faster and faster there's more transistors um but like it's not actually the gpus itself is getting faster there's actually methods behind them that make it faster so there are like I'm going to talk about like six different things um the first one is bit representation so if you want to reduce the bit representation you can actually make stuff faster and also reduce vram Hardware is also very important so tensor calls makes the process faster um algorithms is also very important um that makes trading faster um and it doesn't like you know reduce accuracy kernels fusing and high quality data has been recently a Hot Topic um the fine web data set um so yes that's very useful as well and I'm also going to talk about the future as well what I think like hypothesis and stuff um that is probably going to be in the future so firstly on bit representation right so like in the olden days everyone just uses float 32 does everyone remember that um so you know exponent 8 Bits manesa 23 bits generally speaking this is just a rough estimate mantisa squar is how many transistors approximately the GPU like will use right so like 23 S you know 529 Plus 8 so the um you know the approximate complexity is like 537 transistors right it's it's not exact but like it's around there right float 16 reduced the mentia a lot um the main the main importance is you need to reduce the mantisa bits because you need to have squared the number of mantisa um transistors right so like float 16 is essentially five times faster than float 32 right not two times it's five because there's you know five times less transistors B FL 16 is actually technically faster than float 16 but if you look at the actual numbers for like Nvidia gpus AMD gpus float 16 and B flat 16 the ter flops are like similar um it's still a I I'm not sure like you know maybe B 16 should have more ter flops but I'm assuming they like use the same um circuits and stuff like that and it's probably easier to put into the GPU floate is also a new um you know measure that people really like now um that you know there's like two different formats there's you know four exponent and there's a three uh there's a five and so the transistors are actually different from each um and also there's a new format for the b100s right so like they float four um and it's actually not Flo 4 there's actually scaling factor for each 32 numbers um so that's actually supposedly 180 times less transistors um but it doesn't mean it's 180 times faster right so like there's heat there's other things that's um there and also like you know 1.58 bit um that's a you know everyone's talking about 1.58 bit um in my take is actually it's not going to be that much faster than float 4 um because the number of Manti bits is already one so you can't actually go that much faster anymore also there is like a physics paper called the physics of llm part 3.3 I I think it's from the The Meta team as well they show that if you do quantization until in8 um you don't lose any accuracy but if you go to in four you actually lose two times um knowledge um capacity um the trick is is you have to add Laura and Cura so when you find tune using in four you can then recover all the accuracy so tensor calls you know Nvidia tensor calls they're very useful um especially the hmma and like all the tens instructions they essentially pack they can do maj multiplications in the actual GPU and this reduces overhead a lot um so yes definitely use the tentacles um please do not use the p100s just use the Tesla t4s on kagle this has been a constant question I always get Tesla t4s and kagle are much faster than P 100s um algorithm so now on to algorithms so if there are lots of algorithms that you can do to make training faster for example if you swiglo um if you make um deep and thin networks you can essentially have high accuracy and and the number of parameters doesn't change so there are many methods to make um you know train training faster as well um I think I tweeted out this recently um depth is necessary for reasoning and model size Matters for knowledge so the you know the deeper your network the more the more reasoning it can do um you know quotation reasoning um but but if you make the model larger you can also like have knowledge capacity um so I tweeted about this as well um and so essentially what this means is you make your model as thin as possible make it very deep um and it can still have the same time time time complexity and it's good for reasoning so for other algorithms um gb2 plus rope plus no Dropout is very useful gated MLPs and Swig glue they're actually very hard to train um according to the paper so if you're larger models it's not that bad um if you use different activation functions it doesn't really change accuracy biases doesn't really do anything and yes you definitely must use Flash attention for unso so we actually make training faster as well fine tuning um we um something called unso grading checkpointing um which essentially offloads the activation to system Ram asynchronously um this only increases um the you know time for training by like one to two% chunked cross entropy is also very good um chain matrix multiplication actually reduces actual flops um it does not reduce accuracy but you can actually reduce flops if you bracket correctly so for example if you use unsoft grading checkpointing um you get like the green line um which is your memory usage and you can essentially make memory usage much de decreased and you can do long context fine tuning as well character AI released I don't know if you guys saw the character AI blog they showed how they like made inference FAS so if you use like multiquery attention um you know six Global attention plus the rest lighting window cross layer KV sharing and also VM you know everyone should use VM they have prompt caching pref C prefix caching um you know like Claude said that they have you know prefix caching VM already had it um and chunk prefill so definitely use VM for kernels also you should fuse all the kernels for example in unof we do RMS Lor fusing rope embedding fusing fuse lower for Less flops SG glue and also we um Torch compell Fantastic as well so um we do that a lot of this in unop as well and also yes the recent trend is high quality data right the F web data said The more high quality data set you can get the you know the faster the training is um so great work for the hugging face team from this um so yes definitely use high quality data okay so for my hypothesis for the future my view is float 4 and Float 6 will be actually very hard to train um there is actually research papers to show that float 6 and Float 4 is a bit complicated to do um we still haven't solved the float 8 problem so let see if we can do that um it probably is possible to train but it's going to be hard um also more people will use more sliding window attentions they would use Global attenion but I think that will decrease people will more shift Focus to more data better high quality data and also um deeper models so there will be more layers um thinner models but more layers um and essentially this does not reduce the number of flops and also my this is my hot take but I think Hardware is kind of at its limits um the bit representation was the main driver of performance and like you can go to float two float one and then like what's next float zero like that's not correct so float four is like you know the very limits of Hardware um yeah and so definitely check us out we have a GitHub package as well I'm on Twitter you can definitely talk to me and my brother as well um and yes we love llama we upload all the Llama models bits and bites for a bit quantization I think Tim's de bits and bites very good um you can reduce you can make downloads four times faster as well um and yeah we upload our models on hugging face um thanks for listening yeah\n"