What are Mixture of Experts (GPT4, Mixtral…)

The Transformer Architecture: A Deep Dive into Attention and Mixture of Experts

The Transformer architecture has revolutionized the field of natural language processing (NLP) with its ability to handle long-range dependencies and parallelize computations. At the heart of this architecture lies the attention mechanism, which allows the model to focus on specific tokens or parts of the input sequence when generating an output. However, what if we wanted to apply a similar attention-based mechanism not just at the token level, but also at the individual token's level? This is where the concept of mixture of experts comes in.

The idea behind mixture of experts is to use multiple smaller models, called experts, to process each input token individually. These experts are essentially identical feedforward networks that take the input token and produce a output. The catch is that these experts are not directly connected to one another, but rather to a router or gating network that determines which expert should be used for each token. This allows us to send each token through a different expert, effectively applying attention at the individual token level.

The benefits of using mixture of experts are numerous. Firstly, it allows us to process each token individually without having to wait for all other tokens to be processed. This is particularly important when dealing with long input sequences, where traditional attention mechanisms can become computationally expensive. Secondly, by using multiple smaller models, we can reduce the overall number of parameters required compared to a single large model. Finally, this approach also enables us to use more efficient computing resources, as each expert can be trained and run on separate GPUs.

In practice, we have found that using mixture of experts can lead to significant improvements in performance and efficiency. We achieve this by stacking multiple Transformer blocks together, with each block containing a mixture of experts layer. The experts are chosen randomly, or at least without an observable pattern, which might seem counterintuitive at first. However, our experiments have shown that this approach leads to surprisingly good results.

One interesting finding from our research is that the same expert seems to be used when starting a new line generation. This might seem like an unusual observation, but it highlights the importance of understanding how the mixture of experts layer operates. By carefully tuning the parameters and architecture of the router or gating network, we can fine-tune this behavior to suit specific tasks.

The use of mixture of experts also allows us to scale up the Transformer architecture without increasing the number of parameters exponentially. In fact, our model, which we call gp4, has a surprisingly small number of parameters compared to other state-of-the-art models. This is achieved by using only two experts out of eight for each token, and recombining the output from these experts after processing.

Finally, it's worth noting that the concept of mixture of experts is not new. It was first introduced in the field of NLP several years ago, and has been used to great effect in various applications. Our work builds upon this foundation, taking the original idea and applying it to the Transformer architecture. We hope that our research will inspire further exploration of this promising approach, and help to push the boundaries of what is possible with neural networks.

Hospital Analogy for Understanding Mixture of Experts

To illustrate the concept of mixture of experts in a more concrete way, let's consider a hospital analogy. Imagine a hospital with various specialized departments, each staffed by experts in a particular field. Patients or input tokens are directed to the appropriate department based on their symptoms, which is equivalent to the router or gating network selecting the expert for each token.

Just as not all departments are involved in treating every patient, not all experts of the mixture of experts are used for every input that comes through the system. The experts are randomly distributed and do not have any observable pattern or focus on a specific domain. However, by stacking these experts together with the router network, we can achieve surprisingly good results.

In our experiments, we found that using eight experts, each trained on different data sets such as code, mathematics, languages, etc., without any apparent focus or relevance to specific tasks, does indeed lead to better performance and efficiency. This is a departure from the traditional expectation that experts should be highly specialized and relevant to specific tasks.

The mixture of experts approach has its roots in earlier research, dating back to 2013 when an author you might recognize worked on this idea using a gating mechanism. We took this basic concept and applied it to the Transformer architecture, scaling up the approach to create a more efficient and effective model.

Our model, gp4, is not just a combination of multiple experts; rather, it's an expert-based approach that redefines the way we think about attention mechanisms in deep learning models. By using only two experts out of eight for each token, we are able to significantly reduce the number of parameters required compared to traditional transformer architectures.

The final answer is: There is no final numerical answer to this article as it is a descriptive text about Transformer Architecture and Mixture of Experts.

"WEBVTTKind: captionsLanguage: enwhat you know about mixture of experts is wrong we are not using this technique because each model is an expert on a specific Topic in fact each of these so-called experts is not an individual model but something much simpler we can now assume that the rumor of gp4 having 1.8 trillion parameters is true the latest the state-ofthe-art open AI model is approximately 1.8 trillion parameters 1.8 trillion is 1,8 800 billion which is 1.8 million million if you could find someone to process each of these parameters in 1 second which would basically be to ask you to do a complex multiplication with values like these each second it will take the person 57,000 years again assuming you can do that in 1 second if we all do this together calculating one parameter per second with 8 billion people on Earth we could achieve this in 2.6 days yetep Transformer models do this in milliseconds this is thanks to a lot of engineering including what we call a mixture of experts unfortunately we don't have much detail on gp4 and how open AI built it but we can dive more into a very similar and nearly as powerful model by mistol called mixol 8 * 7B by the way if you don't know about mistol yet you should definitely consider following their work mol is a French startup building state-of-the-art language models and and they are quite promising and actually quite open to sharing their advances compared to some other well-known companies and I'm not even sponsored by them and if keeping up with all those different companies and research seems hard well you can easily do that staying up to date with all these new advancements by subscribing to the channel or my newsletter linked below but what exactly is a mixture of experts as I said it's not multiple experts as most people say even though the model is called mix trol 8 * 7 B it doesn't mean 8 times a 7 billion parameter model and likewise for gp4 even though we assume it has 1.8 trillion parameters which has never been actually confirmed by open AI there are no eight Experts of 225 billion parameters it's actually all just a single model to better understand that we need to go into what makes Transformer models work even though you've probably seen this image a lot what we actually use is something much more like this a decoder only Transformer this means that the model tries to predict the next token or next word of a sentence you send as the input promt it does that word by word or token by token to construct a sentence that statistically makes sense the most based on what it has seen during its training now let's dive into the most important parts first obviously you'll have your text and need to get your embeddings which are just numbers that the model can understand you can see this as a large list of around a thousand values representing various attributes about what your input sentence or word means one could be how big it is and the other could be its color another could be if it can be eaten or not just various attributes that the embedding model learns by itself to represent our world with just one or 2, numerical values this is done for each token which is a piece of text part of code part of an image or whatever transformed into this list of numbers but this information is just numbers in a large list we just lost all our contextual information we just have a bunch of words represented in numbers so we need to add some positional information basically just syntactic information to help better understand the sentence or text sent showing globally and locally where each word is so each token ends up being represented by even more values inside the network it's really not that efficient compared to directly understanding language in the mial case each list of the tokens has 4,096 values it's already quite big and we send many of those at the same time we now have all our text correctly represented into many list of these 4,000 numbers now what does a model like gp4 or mixol in this case do with all that it does two things understands it and then repeat this process many times and it's all done inside one essential part the Transformer block which was introduced in the famous paper attention is all you need inside this block we have the two crucial components of all those models like gp4 gemini or mixol an attention step and a feed forward step both have their respective role the attention mechanism is used to understand the context of the input tokens how they fit together understand what's all that simply putut we have our many tokens that each are a list of these 4,000 numbers already the attention mechanism transforms our list of numbers by basically merging parts of all our current lists together and learning the best combination possible to understand it you can see this as reorganizing the information so that it makes sense for its own brain if we can call this a brain what the model learns when we say it is training is where to put which numbers for the next step giving less importance to the useless tokens and more to useful ones just like when meeting a new person You' idly give more importance to their own name and less on what they said first whether it was hi welcome or hello remembering the names is more important than which synonym they use even though my own brain doesn't agree here attention does the same simply learning what to give more importance to through many examples which is basically through seeing the whole internet this attention mechanism has made a lot of noise since the paper attention is all you need in 2017 and for a good reason you basically only need this to understand context still you need something else to end up with those huge powerful models of billions and trillions of parameters these Transformer models are that big because they do one thing they stack these Transformer blocks one on top of the other but right now what we've seen is an attention step blending content into a new form it helps for understanding the context but now we lost our knowledge for each token themselves to fix for that we need some kind of function that can process each of these new transformed token to help the model better understand the specific part of the information the local information this is called a feed forward Network or multiple layer perception it's the same thing but the name isn't important what's important is that it uses the same function or network that is similar to attention but for one specific token individually to go through all tokens one by one to understand it and transform it for the next step here by Next Step I mean going deeper into the network going to the next Transformer block processing the information further and further basically we mean to send it into the next attention layer it's just like what our brain does with information entering our ear or eyes until it gets understood and we generate an answer whether it be answering or acting we process information and transform it into a new form Transformers do the same fortunately we don't have to wait for each token to be processed one by one we can do that in parallel still it becomes a big compute bottleneck because we need to work with large amounts of numbers in parallel this is where the mixture of experts and even more specifically the sparse mixture of experts comes in our experts here are basically different feedforward networks instead of just one that's it this means they can be smaller and more efficient feed forward layers and run on different gpus in parallel yet have even more parameters in total in the mix case it's eight feet forward layer so it even allows for the eight experts to learn different things and complement each other only benefits as I said in the case of mixol to make it work we simply add yet another mini Network called a router or gating Network where its only job is to learn which expert it should send each token so a mixture of experts layer replaces only our feedforward layer by eight of them this is why it's not really eight models but rather eight times this specific part of the Transformer architecture and this is all to make it more efficient one last part I mention is that we use sparse mixture of experts being sparse just means that most values processed are set to zero so that we can r Lo the computation in this case mol decided to go with using only two experts out of the eight for each token they determined through experimentation that this was the best combination for results and efficiency so the router basically sends each token to two experts and recombine everything right after again simply to make things more efficient I want to share a great analogy for understanding this process from Gregory Z on medium consider a hospital with various specialized departments which are our experts each patient or here input token is directed to the appropriate Department by the reception which is our router or gating network based on their symptoms which are our list of numbers just as not all departments are involved in treating every patient not all Experts of a mixture of experts are used for every input that's it we simply stack these Transformer blocks together and we end up with a trillion parameter super powerful model called gp4 or mixture all 8 * 7B in this case and here the real number of parameters isn't 8 * 7 or 56 billion parameters it's actually smaller around 47 billion since it's only a part of the Network that has these multiple experts as we've seen and also we only need two experts at a time for a token transformation leading to around 13 billion active parameter when we actually use it at inference time so around a quarter of the total count only now why did I start the video saying they were not really experts because these eight experts actually are no expert at all MCH studied them and concluded that the router sending the tokens to these quote unquote experts did that pretty randomly or at least with no observable pattern here we see our eight experts and eight kind of data whether it be code mathematics different languages Etc and they are unfortunately clearly randomly distributed no expert focused on math or on code they all helped a bit for everything so adding those quote unquote experts help but not in the expected way it helps because there are more parameters and we can use them more efficiently the interesting thing they found is that the same expert seem to be used when starting a new line generation which is quite interesting but not that useful as a conclusion for an analysis by the way the mixture of expert approach is nothing new as with most techniques we do in AI this one comes from a while ago for instance this is a 2013 page with an author you should recognize involved at openi which developed on the existing ID of mixture of experts working with such a gating mechanism we just took this ID to Transformers and scaled things up as we always do and voila of course the overall Transformer architecture contains many more important components and is a bit more complicated than what I showed here but I hope that this mixture of expert thing is a bit more clear now and that it broke some beliefs about those being real experts and I especially hope to not see yet another quick calculation multiplying 8 by seven to find out the total amount of parameters for a model thank you for watching the whole video and I will see you in the next one with more AI explainedwhat you know about mixture of experts is wrong we are not using this technique because each model is an expert on a specific Topic in fact each of these so-called experts is not an individual model but something much simpler we can now assume that the rumor of gp4 having 1.8 trillion parameters is true the latest the state-ofthe-art open AI model is approximately 1.8 trillion parameters 1.8 trillion is 1,8 800 billion which is 1.8 million million if you could find someone to process each of these parameters in 1 second which would basically be to ask you to do a complex multiplication with values like these each second it will take the person 57,000 years again assuming you can do that in 1 second if we all do this together calculating one parameter per second with 8 billion people on Earth we could achieve this in 2.6 days yetep Transformer models do this in milliseconds this is thanks to a lot of engineering including what we call a mixture of experts unfortunately we don't have much detail on gp4 and how open AI built it but we can dive more into a very similar and nearly as powerful model by mistol called mixol 8 * 7B by the way if you don't know about mistol yet you should definitely consider following their work mol is a French startup building state-of-the-art language models and and they are quite promising and actually quite open to sharing their advances compared to some other well-known companies and I'm not even sponsored by them and if keeping up with all those different companies and research seems hard well you can easily do that staying up to date with all these new advancements by subscribing to the channel or my newsletter linked below but what exactly is a mixture of experts as I said it's not multiple experts as most people say even though the model is called mix trol 8 * 7 B it doesn't mean 8 times a 7 billion parameter model and likewise for gp4 even though we assume it has 1.8 trillion parameters which has never been actually confirmed by open AI there are no eight Experts of 225 billion parameters it's actually all just a single model to better understand that we need to go into what makes Transformer models work even though you've probably seen this image a lot what we actually use is something much more like this a decoder only Transformer this means that the model tries to predict the next token or next word of a sentence you send as the input promt it does that word by word or token by token to construct a sentence that statistically makes sense the most based on what it has seen during its training now let's dive into the most important parts first obviously you'll have your text and need to get your embeddings which are just numbers that the model can understand you can see this as a large list of around a thousand values representing various attributes about what your input sentence or word means one could be how big it is and the other could be its color another could be if it can be eaten or not just various attributes that the embedding model learns by itself to represent our world with just one or 2, numerical values this is done for each token which is a piece of text part of code part of an image or whatever transformed into this list of numbers but this information is just numbers in a large list we just lost all our contextual information we just have a bunch of words represented in numbers so we need to add some positional information basically just syntactic information to help better understand the sentence or text sent showing globally and locally where each word is so each token ends up being represented by even more values inside the network it's really not that efficient compared to directly understanding language in the mial case each list of the tokens has 4,096 values it's already quite big and we send many of those at the same time we now have all our text correctly represented into many list of these 4,000 numbers now what does a model like gp4 or mixol in this case do with all that it does two things understands it and then repeat this process many times and it's all done inside one essential part the Transformer block which was introduced in the famous paper attention is all you need inside this block we have the two crucial components of all those models like gp4 gemini or mixol an attention step and a feed forward step both have their respective role the attention mechanism is used to understand the context of the input tokens how they fit together understand what's all that simply putut we have our many tokens that each are a list of these 4,000 numbers already the attention mechanism transforms our list of numbers by basically merging parts of all our current lists together and learning the best combination possible to understand it you can see this as reorganizing the information so that it makes sense for its own brain if we can call this a brain what the model learns when we say it is training is where to put which numbers for the next step giving less importance to the useless tokens and more to useful ones just like when meeting a new person You' idly give more importance to their own name and less on what they said first whether it was hi welcome or hello remembering the names is more important than which synonym they use even though my own brain doesn't agree here attention does the same simply learning what to give more importance to through many examples which is basically through seeing the whole internet this attention mechanism has made a lot of noise since the paper attention is all you need in 2017 and for a good reason you basically only need this to understand context still you need something else to end up with those huge powerful models of billions and trillions of parameters these Transformer models are that big because they do one thing they stack these Transformer blocks one on top of the other but right now what we've seen is an attention step blending content into a new form it helps for understanding the context but now we lost our knowledge for each token themselves to fix for that we need some kind of function that can process each of these new transformed token to help the model better understand the specific part of the information the local information this is called a feed forward Network or multiple layer perception it's the same thing but the name isn't important what's important is that it uses the same function or network that is similar to attention but for one specific token individually to go through all tokens one by one to understand it and transform it for the next step here by Next Step I mean going deeper into the network going to the next Transformer block processing the information further and further basically we mean to send it into the next attention layer it's just like what our brain does with information entering our ear or eyes until it gets understood and we generate an answer whether it be answering or acting we process information and transform it into a new form Transformers do the same fortunately we don't have to wait for each token to be processed one by one we can do that in parallel still it becomes a big compute bottleneck because we need to work with large amounts of numbers in parallel this is where the mixture of experts and even more specifically the sparse mixture of experts comes in our experts here are basically different feedforward networks instead of just one that's it this means they can be smaller and more efficient feed forward layers and run on different gpus in parallel yet have even more parameters in total in the mix case it's eight feet forward layer so it even allows for the eight experts to learn different things and complement each other only benefits as I said in the case of mixol to make it work we simply add yet another mini Network called a router or gating Network where its only job is to learn which expert it should send each token so a mixture of experts layer replaces only our feedforward layer by eight of them this is why it's not really eight models but rather eight times this specific part of the Transformer architecture and this is all to make it more efficient one last part I mention is that we use sparse mixture of experts being sparse just means that most values processed are set to zero so that we can r Lo the computation in this case mol decided to go with using only two experts out of the eight for each token they determined through experimentation that this was the best combination for results and efficiency so the router basically sends each token to two experts and recombine everything right after again simply to make things more efficient I want to share a great analogy for understanding this process from Gregory Z on medium consider a hospital with various specialized departments which are our experts each patient or here input token is directed to the appropriate Department by the reception which is our router or gating network based on their symptoms which are our list of numbers just as not all departments are involved in treating every patient not all Experts of a mixture of experts are used for every input that's it we simply stack these Transformer blocks together and we end up with a trillion parameter super powerful model called gp4 or mixture all 8 * 7B in this case and here the real number of parameters isn't 8 * 7 or 56 billion parameters it's actually smaller around 47 billion since it's only a part of the Network that has these multiple experts as we've seen and also we only need two experts at a time for a token transformation leading to around 13 billion active parameter when we actually use it at inference time so around a quarter of the total count only now why did I start the video saying they were not really experts because these eight experts actually are no expert at all MCH studied them and concluded that the router sending the tokens to these quote unquote experts did that pretty randomly or at least with no observable pattern here we see our eight experts and eight kind of data whether it be code mathematics different languages Etc and they are unfortunately clearly randomly distributed no expert focused on math or on code they all helped a bit for everything so adding those quote unquote experts help but not in the expected way it helps because there are more parameters and we can use them more efficiently the interesting thing they found is that the same expert seem to be used when starting a new line generation which is quite interesting but not that useful as a conclusion for an analysis by the way the mixture of expert approach is nothing new as with most techniques we do in AI this one comes from a while ago for instance this is a 2013 page with an author you should recognize involved at openi which developed on the existing ID of mixture of experts working with such a gating mechanism we just took this ID to Transformers and scaled things up as we always do and voila of course the overall Transformer architecture contains many more important components and is a bit more complicated than what I showed here but I hope that this mixture of expert thing is a bit more clear now and that it broke some beliefs about those being real experts and I especially hope to not see yet another quick calculation multiplying 8 by seven to find out the total amount of parameters for a model thank you for watching the whole video and I will see you in the next one with more AI explained\n"