Switch Transformers - Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
The development of large language models has been a topic of significant interest and research in the field of artificial intelligence. One such model that has garnered attention is the trillion parameter model, which was recently proposed by researchers. This model is significantly larger than its predecessor, the T5 model, with a vast number of parameters that have raised questions about its effectiveness.
According to the researchers, the trillion parameter model was developed as an attempt to create a model that surpasses the performance of the T5 model in terms of log perplexity and downstream tasks. However, despite its large size, the model has been found to be less effective than the T5 model in certain areas. The researchers suggest that this is because they were primarily focused on achieving a trillion parameters rather than finding optimal trade-offs between different hyperparameters.
The researchers also mention that training such large models can be challenging due to the enormous number of parameters involved. To address this, they scaled down the model by reducing the number of heads and layers while increasing the number of experts. This approach allowed them to maintain a relatively low computational cost per token, which is an important factor in large language models.
Despite these efforts, the trillion parameter model was found to be less effective than the T5 model in certain areas. The researchers attribute this to the trade-offs that were made during the development of the model. They also mention that the model's performance can vary depending on the specific task and dataset being used.
In addition to discussing the development of the trillion parameter model, the researchers have also explored various techniques for improving its performance. One such technique is selective precision, which involves scaling up or down the precision of vectors as needed. This approach allows the model to balance between communication cost and computational efficiency.
The researchers have also found that selective dropout can be an effective way to improve the stability of the model. By selectively dropping out certain experts during training, they were able to reduce the impact of noise in the model's weights and improve its overall performance.
Another technique used by the researchers is better initialization of the model's weights. They found that initializing the weights with a factor of 10 less than the original transformer leads to more stable training and improved performance.
One of the most interesting aspects of the trillion parameter model is its ability to be distilled into a smaller model that retains a significant portion of its original performance. The researchers were able to distill the model down to a size equivalent to the T5 base dense model, which outperformed the original model when trained from scratch.
The researchers also found that they can retain up to 95% of the model's performance by distilling it down, making it a promising approach for large language models. This ability to distribute the trained models and use them in various applications is a significant advantage over other approaches.
In conclusion, the development of large language models like the trillion parameter model has been an active area of research in recent years. While this model shows promise in certain areas, its performance can be affected by various factors such as trade-offs between different hyperparameters and communication costs. The researchers have explored several techniques to improve the model's performance, including selective precision, selective dropout, and better initialization.
The ability to distill large models down to smaller versions while retaining a significant portion of their original performance is an exciting area of research that has the potential to revolutionize the field of artificial intelligence. By distributing trained models and using them in various applications, researchers can unlock new possibilities for natural language processing and other areas of AI.
The code used by the researchers to implement this model is also available, which provides a starting point for others to explore and build upon this work. The use of mesh TensorFlow, selective precision, and selective dropout are just a few examples of the techniques used in this research, providing insight into the approaches that can be taken to improve the performance of large language models.
Overall, the trillion parameter model is an interesting example of how researchers are pushing the boundaries of what is possible with large language models. While it has its limitations, the approach used by the researchers provides a promising direction for future research and development in this area.
"WEBVTTKind: captionsLanguage: enhi there today we'll talk about switch transformers scaling to trillion parameter models with simple and efficient sparsity by william fetus barrett zoff and noam shazier of google brain so as you can see right off the title we're going towards trillions of parameters gpt3 had 175 billion parameters this paper claims to have a model with a trillion parameters now is it really five times bigger or ten times bigger than gpt3 that's a debatable question because the trillion parameters are not used in the same way as in a classic transformers uh they are used actually in a sparse way that's why the word sparsity is in here and the way they are used in sparse manner is this new architecture called the switch transformer it's not entirely new the it's built on mixture of experts uh in this paper that's also called m-o-e that has been around for a while and we're going to see what that is now on a high level switch transformers takes mixture of experts to an extreme in that it is a transformer um and the feed forward layer is divided up into these experts and the switch transformer only routes each token to one expert only that's the sparse part so the mixture of experts previously they always claimed you need at least two experts in order to get a stable training signal the switch transformer manages to get it down to a single expert so it's like a hard routing of information to just a single endpoint per layer of each token so in that that means you can now scale the experts and you can scale the number of parameters in the model without making the model compute more that's a very special notion so you can up the parameters of the model but if a forward pass of a data point will still have the same amount of flops that it needs to forward propagate through the network very special architecture right here so yeah that's why i'm saying trillion parameters not necessarily comparable to the 175 billion parameters of something like gpt3 so how do they do it because previously was claimed it was unstable they have new ways of making the training stable such as selective dropout selective casting of parameters to different precisions and a better initialization so that's the high level overview of the paper and we'll dive into it we'll explore kind of what mixture of experts is and how the model works and what turns out it's a very long paper as you can see when papers have a table of uh that's a lot of fun uh but it's a lot of engineering as well and we're mostly interested in the model here what it can do and what does it how does it sort of fit in to the big world of transformers and language models and so on last thing i want to say trillion parameters is you know it's a catchy title that most of the paper they don't work with trillion parameter models they work with models in the in the order of billions of parameters and at the end they build a model with a trillion parameters it doesn't do as well as their models with in as their smaller models uh they also it feels like they don't put that much you know work into it because it's probably also quite uh fuzzy and expensive but just know we're not going to have trillion parameter models around anytime soon just yet interesting fact the original resnet paper also built a 1 000 layer convolutional neural network even though the resnets we have today you know they are maybe 50 or 150 layers deep they did build a thousand layer model so maybe compare it a bit to that one it's just like we can do it not necessarily we need to so here you can see something they discover the curve on the left is very very known to people that are in the language model game let's say or in the in the let's scale up ai game and that is as you increase the size of the model the loss will go down and that's loss as i understand it so that's test loss um i believe that is perplexity so scaling properties exactly that that might be perplexity or test loss on some uh downstream task in any way as you scale up the model parameters the model gets better and better and better the interesting thing right here is twofold first of all i believe they do hold the data set constant so the data set is always the same the amount of compute you put into it the amount of either number of steps or time is also always the same and in this specific case the amount of flops per forward pass is also the same the only thing that changes is the number of parameters again it's very special to have a model where you can scale up the number of parameters yet the flops required to forward propagates stay the same so you can see here that there is a almost unhalted decrease here it flattens out a little bit towards the bottom though that is not necessarily does not necessarily mean it will ever flatten out before it's you know at zero um i will approach zero i guess so and you can you can see that you know they scale up the model quite a bit and also their main comparison here is the t5 base so that's the text to text transfer transformer by the way if you don't know what a transformer is or what a language model is um it's best you you go back to my earlier videos and look up like the the gpt3 paper or the attention is all you need paper i've made videos about lots of these things i assume that you know them you can see right here that if you compare to number of training steps for example the these switch models all of them no matter how big they are they provide massive gains over like something like a t5 and they also do this in time so this paper is very much about trade-offs you do require more storage for your weights so you have to have more memory more ram however that memory can be distributed it can be sharded because they use this mesh tensorflow library to implement the switch transformers and because their model has this sparsity they can efficiently shard the model so you trade off more memory which can be shorted but what you gain is training speed in both in terms of time and number of training steps required so you are much more efficient note that this only all of this holds in this super large regime right we this is they say they've also discovered these speed ups in smaller models but you know as far as the paper is concerned we are talking about millions hundreds of millions of parameters billions of parameters even to trillion of parameters together with these giant corp corpora of of text so that's sort of the regime we are in and the results do not necessarily transfer down to the lower scale problems that you know you might face with your lonely one colab in the corner all right so in a transformer you have a transformer is nothing else but a bunch of these layers right here this is this is in in itself a transformer layer uh in its basic form and it consists of sort of two parts it consists of this self attention right here now that's the standard transformer self attention that's the what was introduced in attention is all you need and what's been used ever since in all the transformers um this one right here is a is an uh as i understand it a language model so you know this this is very standard however after the self attention you have this feed forward layer now usually what you do is you have an input sequence and you transform that through multi-head attention into another sequence right here okay and then what you do is you take each of these things and feed them through a feed-forward layer and if i as i understand it this feed-forward layer is simply you know a regular feed forward layer that you would find in a neural network and you pass them you pass these things individually so this here it's a vector you pass it through here and boom that becomes the next layer representation this thing right here you pass it through as well boom that becomes this one and so on right you pass them individually to get the next layer representation so this this part right here the attention part it sort of aggregates information and relates the individual items of the sequence to each other and transforms them into you know a new sequence where sort of all the every token can gather information from every other token that's what the attention mechanism does that's step one in step two every token is isolated every token is for itself and the feed forward layer simply determines you know what's given one token given token number one what is you know given its representation in this layer what is the best representation for the next layer okay so that's token number one of the next layer so the multi-head attention is kind of relating uh tokens to each other and the feed forward layers they are relating layers to each other okay so up here you would have the next multi-head attention layer so you can see the feed forward layer as sort of translating from one layer to the next layer right getting saying ah you come from this layer i'm going to translate you such that the next layer understands you and that happens on a token by token basis now you can see this is it's always the same feed forward layer for all the tokens right the tokens are sort of treated like a batch of samples the idea of this switch transformer and also of the earlier mixture of experts transformer is that it might not be a good idea to have only a single one right this is the only feed forward layer it's the same for all the tokens it might actually be a good idea to have a couple of them that sort of specialize in different things so what could that be you know in a in a basic world uh this could just be like one for nouns and this could be a feed-forward layer for verb verbs tokens that are verbs tokens that are adjectives and sort of maybe here is like punctuation tokens right you might um think well if you are a noun token the next layer might want to look differently at you than if you are a punctuation token right so this translation from one layer to the next layer can now happen dependent um on what the token represents right now we we of course first of all we don't have these annotations and second it's not necessarily that you know we want to always divide it by noun verb adjective punctuation ideally we want to learn this routing so we simply want to say look instead of just one feet forward layer we uh give the model four feet forward layer fee for layer one two three and four and for each token the model can decide to which of these feed forward layer it sends the token to so here you can see this is a token now you know we are dealing with word pieces let's just say the word more i was like i was thoroughly confused by when i saw this like huh why does it say more parameters but here it's the string more right and the string parameters and these are in the vocabulary and they get an embedding vector associated with them uh so that's what's going on here then they go through self-attention as you can see here both go through self-attention and then each one of them is routed to one of these four experts now the the one here the one on the left and the one on the right these are the same experts right they're just duplicated uh visually here but these would be the same weight matrices in there so you have four feet forward layers in this layer and each token can be routed to any one of them and this routing here this is learned so in here you have a matrix they call it like wr and using wr you simply do an inner product of wr with your input right here let's call that h with your input h i guess they use h for a different thing i think they they call this x again so you do this with x and then you get you get h which is your routing and then you simply build a histogram you normalize the histogram i think with the softmax and that those are your routing weights so it's very much like another attention mechanism except that the queries um this thing here these are like the queries these are sort of the queries of this attention mechanism and this here these are the keys and the values so that's the keys and the values of this attention mechanism uh the queries are just learned so the queries are not dynamically generated and the keys and values they are not um yeah it's a it's a weak analogy but you can sort of think of it like this so there is this routing mechanism um and it decides where a token gets goes to now as you can see the router is soft that means there is never a one or a zero right here there's always kind of a number in between but they hard clip that so they hard clip it they just route it to the maximum as you can see here number two is the maximum and they just route it to number two they don't route it proportionally or anything they just to take rmax and they route it through they do multiply the output by the actual number that they got out here so if the router is unsure that the output is less uh if the router is sure the output is more but this hard routing is what's the key right here and that means you know before before you'd have one feet forward layer so any token that goes forward goes through one feet forward layer if you do a mixture of experts in the classic sense and you route it in a soft way you now have four feet forward layer so every token goes through four um of these computations so you've basically multiplied the amount of computation by four because you've multiplied the amount of parameters by four right you have four times as many parameters now when you do this argmax routing like the switch transformer you have multiplied the number of parameters in your model by four but any token will still only incur one feed forward layer that means you keep the amount of computation that you do per forward pass the same and that's that's sort of the the key right here so now they can scale up massively the number of experts while still keeping the amount of flops the same and notably you also don't need a any data transfer in between the experts uh every every expert can be can you know receive their tokens and then do their independent work so you can efficiently shard this across many many machines this is how this looks so in in this case you have uh three experts and your sequences are of line of length six so you wanna sort of route each token there and there can be overflow like every token is independently routed so it can happen something like this that a you know a token like three token gets routed to one expert but it only has space for two tokens and they have some tricks like they have this capacity factor right here or they can reroute these are very much engineering things which are important but you know they don't change the sort of final final result now i want to go down here where they have a display of this sharding uh more like an explanation of the sharding which i think is very uh illustrative so how what do they essentially do if you think of many machines you have 16 machines so each little square here is one machine okay um here are the different ways of how you can shard a model and model sharding you know we are not going to build a machine anytime soon that can hold a trillion parameters just not gonna happen okay so you need to somehow shard the model or the data or both and these are the different ways how you can do it so if you use data parallelism that is the easiest that is also directly built into things like pi torch and so on what you do is so the top row shows how the model weights are split and the bottom row shows how the data is split so how to read this is when you do data parallelism the weights are split such that each of the 16 cores has the same weights you see so this these weights right here are the same as these weights are the same they're all the same so this is shorted the data is run so that you take a data set you take a batch of data and now you distribute this data point goes here this data point goes here this data point goes here and so on you distribute the data and you do the forward propagation and at the end you sort of gather them again right so you gather them together again if because you have to you know calculate your gradient okay so that's data parallelism the model is spread out and if you want to do an update to the model then you need to communicate around these weights okay so all these different pieces have to then communicate with each other when there's a weight update if you do data parallelism um here is how the data split we've already seen this so one piece this piece of data is split over 16 cores so you can see like this core right here only has this little piece of the data and not all of the data on the other hand you can do model parallelism in model parallelism you can see it's exactly the other way around namely that one core only has a little piece of model right and but every core gets all of the data so this data here the bottom row is data all of the data the point here is that if you do model parallelism that's what you do when the model itself doesn't fit right over here the model fits on your machine but not the whole batch at the same time model parallelism you do when the model itself doesn't fit what you have to do is you have to take your data right and you have to send it sequentially so maybe this is the first layer like that's layer one weights and then you have to compute layer one and then you have to send it to layer two and so on so you have to send it sequentially through the through the sharding of the model right because you wanna forward propagate through all of the model this is has very very much of a cost of communication you can build very big models but it comes at a cost right at the end you get your y and you calculate your loss and you backprop again backwards through the whole thing you can mix them right you can do model and data parallelism so here you can see that the weights so this is this is layer one weights layer two layer three layer four and here again you have layer one layer two layer three layer four and so on um so you can mix the two in that you can have model and data parallelism if both your model and also your data don't fit uh in a single machine and you can see here that the this upper left part receives they receive the same data but this here receives different data right so you split your mini batch into four different parts and you send the first part up here like that's data one you send that up here and that goes through the model in this sequence sequential fashion uh you send data to right to here and so on so we mix the two now in expert and data parallelism that's what they that's what they do in the switch transformer so this here is the switch transformer and this here over here will then that's the switch transformer one trillion so for the one trillion model they actually need to mix all of them but you want to at you know if you can you want to avoid model parallelism model parallelism is really the thing that kills you because of the very high communication cost so in the switch transformer they have expert and data parallelism what does it mean so the top row is how the model weights are split and you can see the weights are split but the different color means that they're different weights so here are weights number one weights two weights three weights four and so on now we've already had this over here right different weights in the model parallelism case were split over different machines however if you look at the data the data is also split and the weights they're not the same and these are exactly these experts so experts this means that you know this piece of data here only goes to this expert and then to the output this piece of data right here only goes to this expert and then to the output right there is no communication between the uh different experts whereas here you have this super high communication okay so you can see you can scale up the experts as you scale up your data as long as each shard of data is routed to only one expert and then of course you can mix the expert model and data parallelism um if you really if not even a single expert fits on a machine right if that's the case you need to again short you do model sharding on the experts all right so the switch transformer as i said this here is the switch transformer that the most of the paper is about and now we can dive into the results the results are pretty spectacular they mostly come compare as i said to t5 base and t5 large and as you can see right here the switch model has significantly more parameters so 7.4 or here 26 a billion parameters compared to not even a billion of t5 large yet the number of flops is matched so they build models where the number of flops for a forward prop is matched but the the number of parameters are higher so you know it is somewhat of a fair comparison right you have the same amount of compute done per forward prop and now we see what does it help to just have raw again in parameters and it turns out it helps a lot you've pro already seen that we get these massive speed ups massive sample efficiencies over a dense model you've probably so this we've looked at exactly in the in the intro they also have benchmarks on oh let's see that's down here they also have benchmarks on multilingual um on multi-lingual data set and you can see in every single language the switch transformer gains on the dense transformer by quite a bit so this is in this is log space as you can see and it's quite impressive actually and these gains are in time as well as number of steps um so that's pretty pretty cool so as i as i said uh the the trade-off here of course is that you need more machines you need to actually add more machines and you can see this largest model that they built is this switch xxl which is matched in flops to trans2 to t5xxxl model yet has many more parameters and beats the t5 at log perplexity and in as i understand in downstream tasks by quite a bit they also built this trillion parameter model it is not as good mainly because they as i understand it they just want to get to a trillion parameters and i think i think it's you know training isn't really easy at that size so they scale it down as you can see it has less number of heads less number of layers but the number of experts are way up so that's how they scale to a trillion and the results aren't you know better than the t5x xl um which is impressive given that it has less flops per token however it is still worse than the switch xxl so the trillion parameter model um it's still you know it's still not everything to have a lot of parameters you actually need to do good trade-offs and here they've traded off too many parameters for you know less number of heads and less number of layers and that hurts again so you know very very interesting stuff right here the last thing i want to look at is their tricks for getting this to work so they detail three tricks for getting this to work and they are right here three tricks how they can do this and people before them have said no you need at least two experts otherwise it's instable so they do selective precision with the large sparse models which means that if for some of these computations it you know it it pays off to do them in higher precision you don't want to send around these float32 precision uh things you don't want to send those from machine to machine right so you have your input you have your multi-head attention and then here again this is whatever x prime and then you send that to the experts right here are the different experts and then you send that back and that's why okay now you don't want this here is communication cost if you were to send around float 32 vectors that's a lot of data that you have to transmit so you'd rather send around 16-bit precision right as they do right here um however if you do 16-bit precision your you know the the whole machine learning part doesn't work as well so what they do is they do as soon as it as a as soon as a vector arrives here this is in 16 bit they scale it up they cast it to a 32 bit vector they calculate using the 32-bit vector 32 and then they cast it again to a 16-bit vector to send it back and that seems to work so they do selective selectively casting the precision up and also they do selective dropout that's down here so they do expert dropout which means they don't apply drop out to the whole network uniformly as you would do normally but they say they can do a much larger dropout rate at expert layers and that makes a bit of sense because the expert each expert is only used very sparsely so it makes sense to up their dropout rate because you know in the end you might drop out as much signal from a sparsely used expert if you raise the dropout rate then you do from a densely used layer in a with a smaller dropout rate and the last thing is that they simply do better initialization so they find if they scale down the the initial scale of the original transformer by a factor of 10 that leads to a lot more stable training it's astounding that after so many years still something like initialization can you know make or break such a model that is just insane to see there's a lot more to this paper they do a lot of downstream tasks they also talk a lot about you know this is not only this model they do a lot of optimizations under the hood uh they use mesh tensorflow and so on it's clear that a lot of work has gone into this and interestingly enough they can also distill these models so what they can do is they can take this large model and they distill it to a model that is as big as t5 base a dense model so they go from a sparse large model and they distill it into a dense model that is equivalent to t5 and they do outperform t5 if it were trained from scratch and they gain up to something like 30 percent so 30 of the gains they made from here to here they can retain by distilling it down they say they can distill it down way over 90 95 percent of the model which is also pretty interesting and you know pretty cool uh because then you could sort of distribute the trained models around and people could use them all right so that was it for me definitely check out the paper and all the experiments downstream tasks and so on it's very cool paper has a lot of cool experiments there's code at least pseudocode and that was it thank you byehi there today we'll talk about switch transformers scaling to trillion parameter models with simple and efficient sparsity by william fetus barrett zoff and noam shazier of google brain so as you can see right off the title we're going towards trillions of parameters gpt3 had 175 billion parameters this paper claims to have a model with a trillion parameters now is it really five times bigger or ten times bigger than gpt3 that's a debatable question because the trillion parameters are not used in the same way as in a classic transformers uh they are used actually in a sparse way that's why the word sparsity is in here and the way they are used in sparse manner is this new architecture called the switch transformer it's not entirely new the it's built on mixture of experts uh in this paper that's also called m-o-e that has been around for a while and we're going to see what that is now on a high level switch transformers takes mixture of experts to an extreme in that it is a transformer um and the feed forward layer is divided up into these experts and the switch transformer only routes each token to one expert only that's the sparse part so the mixture of experts previously they always claimed you need at least two experts in order to get a stable training signal the switch transformer manages to get it down to a single expert so it's like a hard routing of information to just a single endpoint per layer of each token so in that that means you can now scale the experts and you can scale the number of parameters in the model without making the model compute more that's a very special notion so you can up the parameters of the model but if a forward pass of a data point will still have the same amount of flops that it needs to forward propagate through the network very special architecture right here so yeah that's why i'm saying trillion parameters not necessarily comparable to the 175 billion parameters of something like gpt3 so how do they do it because previously was claimed it was unstable they have new ways of making the training stable such as selective dropout selective casting of parameters to different precisions and a better initialization so that's the high level overview of the paper and we'll dive into it we'll explore kind of what mixture of experts is and how the model works and what turns out it's a very long paper as you can see when papers have a table of uh that's a lot of fun uh but it's a lot of engineering as well and we're mostly interested in the model here what it can do and what does it how does it sort of fit in to the big world of transformers and language models and so on last thing i want to say trillion parameters is you know it's a catchy title that most of the paper they don't work with trillion parameter models they work with models in the in the order of billions of parameters and at the end they build a model with a trillion parameters it doesn't do as well as their models with in as their smaller models uh they also it feels like they don't put that much you know work into it because it's probably also quite uh fuzzy and expensive but just know we're not going to have trillion parameter models around anytime soon just yet interesting fact the original resnet paper also built a 1 000 layer convolutional neural network even though the resnets we have today you know they are maybe 50 or 150 layers deep they did build a thousand layer model so maybe compare it a bit to that one it's just like we can do it not necessarily we need to so here you can see something they discover the curve on the left is very very known to people that are in the language model game let's say or in the in the let's scale up ai game and that is as you increase the size of the model the loss will go down and that's loss as i understand it so that's test loss um i believe that is perplexity so scaling properties exactly that that might be perplexity or test loss on some uh downstream task in any way as you scale up the model parameters the model gets better and better and better the interesting thing right here is twofold first of all i believe they do hold the data set constant so the data set is always the same the amount of compute you put into it the amount of either number of steps or time is also always the same and in this specific case the amount of flops per forward pass is also the same the only thing that changes is the number of parameters again it's very special to have a model where you can scale up the number of parameters yet the flops required to forward propagates stay the same so you can see here that there is a almost unhalted decrease here it flattens out a little bit towards the bottom though that is not necessarily does not necessarily mean it will ever flatten out before it's you know at zero um i will approach zero i guess so and you can you can see that you know they scale up the model quite a bit and also their main comparison here is the t5 base so that's the text to text transfer transformer by the way if you don't know what a transformer is or what a language model is um it's best you you go back to my earlier videos and look up like the the gpt3 paper or the attention is all you need paper i've made videos about lots of these things i assume that you know them you can see right here that if you compare to number of training steps for example the these switch models all of them no matter how big they are they provide massive gains over like something like a t5 and they also do this in time so this paper is very much about trade-offs you do require more storage for your weights so you have to have more memory more ram however that memory can be distributed it can be sharded because they use this mesh tensorflow library to implement the switch transformers and because their model has this sparsity they can efficiently shard the model so you trade off more memory which can be shorted but what you gain is training speed in both in terms of time and number of training steps required so you are much more efficient note that this only all of this holds in this super large regime right we this is they say they've also discovered these speed ups in smaller models but you know as far as the paper is concerned we are talking about millions hundreds of millions of parameters billions of parameters even to trillion of parameters together with these giant corp corpora of of text so that's sort of the regime we are in and the results do not necessarily transfer down to the lower scale problems that you know you might face with your lonely one colab in the corner all right so in a transformer you have a transformer is nothing else but a bunch of these layers right here this is this is in in itself a transformer layer uh in its basic form and it consists of sort of two parts it consists of this self attention right here now that's the standard transformer self attention that's the what was introduced in attention is all you need and what's been used ever since in all the transformers um this one right here is a is an uh as i understand it a language model so you know this this is very standard however after the self attention you have this feed forward layer now usually what you do is you have an input sequence and you transform that through multi-head attention into another sequence right here okay and then what you do is you take each of these things and feed them through a feed-forward layer and if i as i understand it this feed-forward layer is simply you know a regular feed forward layer that you would find in a neural network and you pass them you pass these things individually so this here it's a vector you pass it through here and boom that becomes the next layer representation this thing right here you pass it through as well boom that becomes this one and so on right you pass them individually to get the next layer representation so this this part right here the attention part it sort of aggregates information and relates the individual items of the sequence to each other and transforms them into you know a new sequence where sort of all the every token can gather information from every other token that's what the attention mechanism does that's step one in step two every token is isolated every token is for itself and the feed forward layer simply determines you know what's given one token given token number one what is you know given its representation in this layer what is the best representation for the next layer okay so that's token number one of the next layer so the multi-head attention is kind of relating uh tokens to each other and the feed forward layers they are relating layers to each other okay so up here you would have the next multi-head attention layer so you can see the feed forward layer as sort of translating from one layer to the next layer right getting saying ah you come from this layer i'm going to translate you such that the next layer understands you and that happens on a token by token basis now you can see this is it's always the same feed forward layer for all the tokens right the tokens are sort of treated like a batch of samples the idea of this switch transformer and also of the earlier mixture of experts transformer is that it might not be a good idea to have only a single one right this is the only feed forward layer it's the same for all the tokens it might actually be a good idea to have a couple of them that sort of specialize in different things so what could that be you know in a in a basic world uh this could just be like one for nouns and this could be a feed-forward layer for verb verbs tokens that are verbs tokens that are adjectives and sort of maybe here is like punctuation tokens right you might um think well if you are a noun token the next layer might want to look differently at you than if you are a punctuation token right so this translation from one layer to the next layer can now happen dependent um on what the token represents right now we we of course first of all we don't have these annotations and second it's not necessarily that you know we want to always divide it by noun verb adjective punctuation ideally we want to learn this routing so we simply want to say look instead of just one feet forward layer we uh give the model four feet forward layer fee for layer one two three and four and for each token the model can decide to which of these feed forward layer it sends the token to so here you can see this is a token now you know we are dealing with word pieces let's just say the word more i was like i was thoroughly confused by when i saw this like huh why does it say more parameters but here it's the string more right and the string parameters and these are in the vocabulary and they get an embedding vector associated with them uh so that's what's going on here then they go through self-attention as you can see here both go through self-attention and then each one of them is routed to one of these four experts now the the one here the one on the left and the one on the right these are the same experts right they're just duplicated uh visually here but these would be the same weight matrices in there so you have four feet forward layers in this layer and each token can be routed to any one of them and this routing here this is learned so in here you have a matrix they call it like wr and using wr you simply do an inner product of wr with your input right here let's call that h with your input h i guess they use h for a different thing i think they they call this x again so you do this with x and then you get you get h which is your routing and then you simply build a histogram you normalize the histogram i think with the softmax and that those are your routing weights so it's very much like another attention mechanism except that the queries um this thing here these are like the queries these are sort of the queries of this attention mechanism and this here these are the keys and the values so that's the keys and the values of this attention mechanism uh the queries are just learned so the queries are not dynamically generated and the keys and values they are not um yeah it's a it's a weak analogy but you can sort of think of it like this so there is this routing mechanism um and it decides where a token gets goes to now as you can see the router is soft that means there is never a one or a zero right here there's always kind of a number in between but they hard clip that so they hard clip it they just route it to the maximum as you can see here number two is the maximum and they just route it to number two they don't route it proportionally or anything they just to take rmax and they route it through they do multiply the output by the actual number that they got out here so if the router is unsure that the output is less uh if the router is sure the output is more but this hard routing is what's the key right here and that means you know before before you'd have one feet forward layer so any token that goes forward goes through one feet forward layer if you do a mixture of experts in the classic sense and you route it in a soft way you now have four feet forward layer so every token goes through four um of these computations so you've basically multiplied the amount of computation by four because you've multiplied the amount of parameters by four right you have four times as many parameters now when you do this argmax routing like the switch transformer you have multiplied the number of parameters in your model by four but any token will still only incur one feed forward layer that means you keep the amount of computation that you do per forward pass the same and that's that's sort of the the key right here so now they can scale up massively the number of experts while still keeping the amount of flops the same and notably you also don't need a any data transfer in between the experts uh every every expert can be can you know receive their tokens and then do their independent work so you can efficiently shard this across many many machines this is how this looks so in in this case you have uh three experts and your sequences are of line of length six so you wanna sort of route each token there and there can be overflow like every token is independently routed so it can happen something like this that a you know a token like three token gets routed to one expert but it only has space for two tokens and they have some tricks like they have this capacity factor right here or they can reroute these are very much engineering things which are important but you know they don't change the sort of final final result now i want to go down here where they have a display of this sharding uh more like an explanation of the sharding which i think is very uh illustrative so how what do they essentially do if you think of many machines you have 16 machines so each little square here is one machine okay um here are the different ways of how you can shard a model and model sharding you know we are not going to build a machine anytime soon that can hold a trillion parameters just not gonna happen okay so you need to somehow shard the model or the data or both and these are the different ways how you can do it so if you use data parallelism that is the easiest that is also directly built into things like pi torch and so on what you do is so the top row shows how the model weights are split and the bottom row shows how the data is split so how to read this is when you do data parallelism the weights are split such that each of the 16 cores has the same weights you see so this these weights right here are the same as these weights are the same they're all the same so this is shorted the data is run so that you take a data set you take a batch of data and now you distribute this data point goes here this data point goes here this data point goes here and so on you distribute the data and you do the forward propagation and at the end you sort of gather them again right so you gather them together again if because you have to you know calculate your gradient okay so that's data parallelism the model is spread out and if you want to do an update to the model then you need to communicate around these weights okay so all these different pieces have to then communicate with each other when there's a weight update if you do data parallelism um here is how the data split we've already seen this so one piece this piece of data is split over 16 cores so you can see like this core right here only has this little piece of the data and not all of the data on the other hand you can do model parallelism in model parallelism you can see it's exactly the other way around namely that one core only has a little piece of model right and but every core gets all of the data so this data here the bottom row is data all of the data the point here is that if you do model parallelism that's what you do when the model itself doesn't fit right over here the model fits on your machine but not the whole batch at the same time model parallelism you do when the model itself doesn't fit what you have to do is you have to take your data right and you have to send it sequentially so maybe this is the first layer like that's layer one weights and then you have to compute layer one and then you have to send it to layer two and so on so you have to send it sequentially through the through the sharding of the model right because you wanna forward propagate through all of the model this is has very very much of a cost of communication you can build very big models but it comes at a cost right at the end you get your y and you calculate your loss and you backprop again backwards through the whole thing you can mix them right you can do model and data parallelism so here you can see that the weights so this is this is layer one weights layer two layer three layer four and here again you have layer one layer two layer three layer four and so on um so you can mix the two in that you can have model and data parallelism if both your model and also your data don't fit uh in a single machine and you can see here that the this upper left part receives they receive the same data but this here receives different data right so you split your mini batch into four different parts and you send the first part up here like that's data one you send that up here and that goes through the model in this sequence sequential fashion uh you send data to right to here and so on so we mix the two now in expert and data parallelism that's what they that's what they do in the switch transformer so this here is the switch transformer and this here over here will then that's the switch transformer one trillion so for the one trillion model they actually need to mix all of them but you want to at you know if you can you want to avoid model parallelism model parallelism is really the thing that kills you because of the very high communication cost so in the switch transformer they have expert and data parallelism what does it mean so the top row is how the model weights are split and you can see the weights are split but the different color means that they're different weights so here are weights number one weights two weights three weights four and so on now we've already had this over here right different weights in the model parallelism case were split over different machines however if you look at the data the data is also split and the weights they're not the same and these are exactly these experts so experts this means that you know this piece of data here only goes to this expert and then to the output this piece of data right here only goes to this expert and then to the output right there is no communication between the uh different experts whereas here you have this super high communication okay so you can see you can scale up the experts as you scale up your data as long as each shard of data is routed to only one expert and then of course you can mix the expert model and data parallelism um if you really if not even a single expert fits on a machine right if that's the case you need to again short you do model sharding on the experts all right so the switch transformer as i said this here is the switch transformer that the most of the paper is about and now we can dive into the results the results are pretty spectacular they mostly come compare as i said to t5 base and t5 large and as you can see right here the switch model has significantly more parameters so 7.4 or here 26 a billion parameters compared to not even a billion of t5 large yet the number of flops is matched so they build models where the number of flops for a forward prop is matched but the the number of parameters are higher so you know it is somewhat of a fair comparison right you have the same amount of compute done per forward prop and now we see what does it help to just have raw again in parameters and it turns out it helps a lot you've pro already seen that we get these massive speed ups massive sample efficiencies over a dense model you've probably so this we've looked at exactly in the in the intro they also have benchmarks on oh let's see that's down here they also have benchmarks on multilingual um on multi-lingual data set and you can see in every single language the switch transformer gains on the dense transformer by quite a bit so this is in this is log space as you can see and it's quite impressive actually and these gains are in time as well as number of steps um so that's pretty pretty cool so as i as i said uh the the trade-off here of course is that you need more machines you need to actually add more machines and you can see this largest model that they built is this switch xxl which is matched in flops to trans2 to t5xxxl model yet has many more parameters and beats the t5 at log perplexity and in as i understand in downstream tasks by quite a bit they also built this trillion parameter model it is not as good mainly because they as i understand it they just want to get to a trillion parameters and i think i think it's you know training isn't really easy at that size so they scale it down as you can see it has less number of heads less number of layers but the number of experts are way up so that's how they scale to a trillion and the results aren't you know better than the t5x xl um which is impressive given that it has less flops per token however it is still worse than the switch xxl so the trillion parameter model um it's still you know it's still not everything to have a lot of parameters you actually need to do good trade-offs and here they've traded off too many parameters for you know less number of heads and less number of layers and that hurts again so you know very very interesting stuff right here the last thing i want to look at is their tricks for getting this to work so they detail three tricks for getting this to work and they are right here three tricks how they can do this and people before them have said no you need at least two experts otherwise it's instable so they do selective precision with the large sparse models which means that if for some of these computations it you know it it pays off to do them in higher precision you don't want to send around these float32 precision uh things you don't want to send those from machine to machine right so you have your input you have your multi-head attention and then here again this is whatever x prime and then you send that to the experts right here are the different experts and then you send that back and that's why okay now you don't want this here is communication cost if you were to send around float 32 vectors that's a lot of data that you have to transmit so you'd rather send around 16-bit precision right as they do right here um however if you do 16-bit precision your you know the the whole machine learning part doesn't work as well so what they do is they do as soon as it as a as soon as a vector arrives here this is in 16 bit they scale it up they cast it to a 32 bit vector they calculate using the 32-bit vector 32 and then they cast it again to a 16-bit vector to send it back and that seems to work so they do selective selectively casting the precision up and also they do selective dropout that's down here so they do expert dropout which means they don't apply drop out to the whole network uniformly as you would do normally but they say they can do a much larger dropout rate at expert layers and that makes a bit of sense because the expert each expert is only used very sparsely so it makes sense to up their dropout rate because you know in the end you might drop out as much signal from a sparsely used expert if you raise the dropout rate then you do from a densely used layer in a with a smaller dropout rate and the last thing is that they simply do better initialization so they find if they scale down the the initial scale of the original transformer by a factor of 10 that leads to a lot more stable training it's astounding that after so many years still something like initialization can you know make or break such a model that is just insane to see there's a lot more to this paper they do a lot of downstream tasks they also talk a lot about you know this is not only this model they do a lot of optimizations under the hood uh they use mesh tensorflow and so on it's clear that a lot of work has gone into this and interestingly enough they can also distill these models so what they can do is they can take this large model and they distill it to a model that is as big as t5 base a dense model so they go from a sparse large model and they distill it into a dense model that is equivalent to t5 and they do outperform t5 if it were trained from scratch and they gain up to something like 30 percent so 30 of the gains they made from here to here they can retain by distilling it down they say they can distill it down way over 90 95 percent of the model which is also pretty interesting and you know pretty cool uh because then you could sort of distribute the trained models around and people could use them all right so that was it for me definitely check out the paper and all the experiments downstream tasks and so on it's very cool paper has a lot of cool experiments there's code at least pseudocode and that was it thank you bye\n"