Beam Search: A Method for Language Generation and Translation
Beam search is a method used in language generation and translation to find the most likely output sentence given an input sentence. This technique is widely used in machine learning models, particularly in neural machine translation systems.
When using beam search, we first consider all possible choices for the second word in the output English translation. We use this information to calculate the probability of the second word given the input French sentence X and the first word, which was determined by a network fragment that takes into account the vocabulary size. The resulting probabilities are then used to evaluate all possible options for the second word.
In this stage of beam search, we consider three possibilities: the most likely option and two other alternatives. We use the same method to calculate the probability of these alternatives given the input sentence X and the first word. This process continues until we reach the desired length of the output sentence.
For example, let's say that the second word was Jane and we want to find the most likely options for the third word. We would repeat the process used in the previous stage, but with the added restriction that the third word must be chosen from a subset of words that can follow the first two words.
One important aspect of beam search is the choice of beam width, which determines the number of possibilities to consider at each stage. A beam width of 3 means that we consider only three possibilities for the next word, whereas a wider beam width would consider more options.
In this example, if the beam width was equal to one, it would essentially become the greedy search algorithm, which is not as effective in finding the most likely output sentence. However, by considering multiple possibilities at once, beam search can often find better solutions.
Let's say that we are given a French sentence X and want to translate it into English using beam search with a beam width of 3. We use a network fragment that takes into account the vocabulary size to calculate the probability of each word in the output translation. This process continues until we reach the desired length of the output sentence.
Once we have calculated the probabilities for all possible options, we need to decide which ones to keep and which ones to discard. This is where the beam width comes into play again. We pick the top three possibilities at each stage, keeping track of these three choices throughout the process.
Now let's say that after the first two stages of beam search, we have determined that the most likely options for the first two words are in September and Jane visits. We save the probability distribution over all possible choices for the second word given the input sentence X and the first word. This means that we keep track of how likely each possible choice was at this stage.
We then move on to the third step of beam search, where we consider what is the most likely option for the third word. We use a network fragment that takes into account the probability distribution over all possible choices for the second word and the first two words to calculate the probability of the third word given the input sentence X. This process continues until we reach the desired length of the output sentence.
One important aspect of beam search is its ability to handle long-range dependencies in language. While it may seem counterintuitive, beam search can often find better solutions than greedy search algorithms when dealing with long-range dependencies in language. By considering multiple possibilities at once and keeping track of these possibilities throughout the process, beam search can capture complex patterns in language that might be missed by greedy search algorithms.
In conclusion, beam search is a powerful method for language generation and translation that can often find better solutions than greedy search algorithms. Its ability to handle long-range dependencies in language makes it an attractive option for many applications.
"WEBVTTKind: captionsLanguage: enin this video you learn about the beam search algorithm in the last video you remember how for machine translation given an input French sentence you don't want to output a random English translation you want output the best the most likely English translation the same is also true for speech recognition where given an input audio clip you don't want to output a random text transcript of that audio you an output the best maybe the most likely text transcript beam search is the most widely used algorithm to do this and in this video you see how to get beam search to work for yourself let's describe beam search using our running example of the French sentence Jane's is equally free concept on hopefully being translated into Jane visits Africa in September the first thing beam search has to do is try to pick the first word of the English translation that's going to output so here I've listed I'll say ten thousand words in the vocabulary and to simplify the problem a bit I'm going to ignore capitalization so I'm just listing all the words in lowercase so in the first step of beam search or use this network fragment with the encoder shown in green and the decode is shown in purple to try to evaluate what is the probability of that first word so what's the probability of the first output Y given the input sentence x given the French input so we're as greedy search will pick only the one most likely words and move on beam search instead can consider multiple alternatives so the beam search algorithm has a parameter called B which is called the beam width and for this example I'm gonna set the beam width to be equal to three and what this means this beam search because they're not just one possibility but consider three at a time so in particular let's say evaluating this probability over different choices the first word it finds that the choices in Jane and September are the most likely three possibilities for the first words in the English output then beam search will store away in computer memory that it wants to try out all three of these words and if the beam with parameter was said differently if the beam width parameter was say 10 then we keep track of not just three but of the ten most likely possible choices for the first word so to be clear in order to perform this first step of beam search what you need to do is run the input French sentence through this encoder network and then this first step of the decoder network if this is a soft max output over all 10,000 possibilities then you would take those 10,000 possible outputs and keep in memory which were the top of the week let's go on to the second step of beam search having picked in Jane and September as the three most likely choice of the first word what beam search will do now is for each of these three choices consider what should be the second word so after in maybe the second word is a or maybe it's Erin I'm just listing words from the faux caviar from the dictionary or somewhere down to this will be September somewhere down the list is visit and all the way to Z maybe the last word is Zulu so to evaluate the probability is second word it will use this neural network fragments where there's the encoder and green and for the decoder portion when trying to decide what comes off the in remember the decoder first outputs why that's one so we're going to set to this y hat one to the word in and feed this back in so this is the words in because it decided for now that because we're trying to figure out it the first word was in what is the second word and then this will output yes Y hat too and so by hardwiring y hat one here really the input here to be the first word in this network fragment can be used to evaluate what is the probability of the second word given the input French sentence and that the first word of the translation has been the word it now notice that what we ultimately care about in this second step of being search is to find the pair of the first and second words that is most likely so not just a sec where this most likely but the pair of the first and second rows it most likely and by the rules of conditional probability this can be expressed as P of the first word times P of probability of the second word right which you're getting from this new network fragment and so if for each of the three words you've chosen in Jane and September you've saved away this probability then you can multiply them by this second probability to get the probability of the first and second words so now you've seen how if the first word was in how you can evaluate the probability of the second word now if the first word was Jane you'd do the same thing the sentence could be Jane a Jane Erin and so on down to Jane is Jane visits and so on and you will use this neural network fragment and let me draw this in as well we're here you would Hotwire Y had one to eat Jane and so with the first word y1 hat hardwired as Jane then this new network fragments can tell you was it probably if the second words give me input X and given that the first word was Jane and then same as above you can multiply with P of y1 to get the probability of y 1 and y 2 for each of these 10,000 different possible choices for the second word and then finally you do the same thing for September all the words from a down to Zulu and use this network fragment I just told us in as well so see if the first word was September what weather the most likely options for the second word so for this second step of beam search because we're continuing to use a beam width of three and because there are 10,000 words in the vocabulary you'd end up considering three times ten thousand or thirty thousand possibilities because there are ten thousand here ten thousand year ten thousand years isn't being worth been since the beam width times the number of words in the vocabulary and what you do is you evaluate all of these thirty thousand options according to the probability of the first and second word and then pick the top three so we're gonna cut down these thirty thousand possibilities down to three year getting down to the beam width parameter again and so let's say that these thirty thousand choices the most likely were in September and say Jane is and Jane visits sorry this bit messy but those are the most likely three out of the thirty thousand choices then that's what beam search would memorize away and take on to the next step of beam search so notice one thing if beam search decides that the three most likely choices for the first and second words are in September or Jane is or Jane visits then what that means is that it is now rejecting September as a candidate for the first word of the output English translation so we're now down to two possibilities for the first word but we still have a beam worth of three keeping track of three choices for pairs of y1 y2 before going on to the third step of beam search just wanna notice that because the beam width is equal to three at every step you instantiate three copies of the network to evaluate these partial sentence fragments in the output and it's because the beam width is equal to three that you have three copies of the network with different choices for the first word but these three copies of the network can be very efficiently used to evaluate all 30,000 options for the second word so just don't instantiate 30,000 copies in the network you need only three copies of the network to very quickly evaluate all 10,000 possible outputs that that softmax outputs a 4y 2 let's just quickly illustrate one more step of beam search so said that the most likely choices for first two words were in September genius and Jane visits and for each of these pairs of words we should have saved the way in computer memory the probability of y 1 and y 2 given the input next given the French sentence X so similar to before we now want to consider what is the third word so is it in September a in September Aaron all the way down to is in September Zulu and to evaluate possible choices for the third word you use this network fragment where you hardwire the first word here to be in the second word to be September and so this network fragment allows you to evaluate what's the probability of the third word given the input French sentence X and given that the first two words are in September in the English output and then you do the same thing for the second fragment so like so and same thing for Jane visits and so being searchable then once again pick the top three possibilities maybe things in September Jane is a likely outcome or Jane is visiting as likely or maybe Jane versus Africa is likely for the first three words and then it keeps going and then you go on to the fourth step of beam search that had one more word and on it girls and the outcome of this process hopefully will be that adding one word at a time that beam search will decide that Jane versus African September will be terminated by the end of and symbol using that in your system which is quite common that will find that this is a likely output English sentence and you see more details of this for yourself in this week's spring exercise as well where you get to play with beam search yourself so with a beam worth of three beam search considers three possibilities at a time notice that um if the beam wave was said to be equal to one so considers only one then this essentially becomes the greedy search algorithm which we had discussed in the last video but by considering multiple possibilities say three or ten or some of the number at the same time beam search will usually find a much better output sentenced and greedy search you've now seen how beam search works but it turns out there's some additional tips and tricks and refinements that helps you to make beam search work even better let's go onto the next video to take a lookin this video you learn about the beam search algorithm in the last video you remember how for machine translation given an input French sentence you don't want to output a random English translation you want output the best the most likely English translation the same is also true for speech recognition where given an input audio clip you don't want to output a random text transcript of that audio you an output the best maybe the most likely text transcript beam search is the most widely used algorithm to do this and in this video you see how to get beam search to work for yourself let's describe beam search using our running example of the French sentence Jane's is equally free concept on hopefully being translated into Jane visits Africa in September the first thing beam search has to do is try to pick the first word of the English translation that's going to output so here I've listed I'll say ten thousand words in the vocabulary and to simplify the problem a bit I'm going to ignore capitalization so I'm just listing all the words in lowercase so in the first step of beam search or use this network fragment with the encoder shown in green and the decode is shown in purple to try to evaluate what is the probability of that first word so what's the probability of the first output Y given the input sentence x given the French input so we're as greedy search will pick only the one most likely words and move on beam search instead can consider multiple alternatives so the beam search algorithm has a parameter called B which is called the beam width and for this example I'm gonna set the beam width to be equal to three and what this means this beam search because they're not just one possibility but consider three at a time so in particular let's say evaluating this probability over different choices the first word it finds that the choices in Jane and September are the most likely three possibilities for the first words in the English output then beam search will store away in computer memory that it wants to try out all three of these words and if the beam with parameter was said differently if the beam width parameter was say 10 then we keep track of not just three but of the ten most likely possible choices for the first word so to be clear in order to perform this first step of beam search what you need to do is run the input French sentence through this encoder network and then this first step of the decoder network if this is a soft max output over all 10,000 possibilities then you would take those 10,000 possible outputs and keep in memory which were the top of the week let's go on to the second step of beam search having picked in Jane and September as the three most likely choice of the first word what beam search will do now is for each of these three choices consider what should be the second word so after in maybe the second word is a or maybe it's Erin I'm just listing words from the faux caviar from the dictionary or somewhere down to this will be September somewhere down the list is visit and all the way to Z maybe the last word is Zulu so to evaluate the probability is second word it will use this neural network fragments where there's the encoder and green and for the decoder portion when trying to decide what comes off the in remember the decoder first outputs why that's one so we're going to set to this y hat one to the word in and feed this back in so this is the words in because it decided for now that because we're trying to figure out it the first word was in what is the second word and then this will output yes Y hat too and so by hardwiring y hat one here really the input here to be the first word in this network fragment can be used to evaluate what is the probability of the second word given the input French sentence and that the first word of the translation has been the word it now notice that what we ultimately care about in this second step of being search is to find the pair of the first and second words that is most likely so not just a sec where this most likely but the pair of the first and second rows it most likely and by the rules of conditional probability this can be expressed as P of the first word times P of probability of the second word right which you're getting from this new network fragment and so if for each of the three words you've chosen in Jane and September you've saved away this probability then you can multiply them by this second probability to get the probability of the first and second words so now you've seen how if the first word was in how you can evaluate the probability of the second word now if the first word was Jane you'd do the same thing the sentence could be Jane a Jane Erin and so on down to Jane is Jane visits and so on and you will use this neural network fragment and let me draw this in as well we're here you would Hotwire Y had one to eat Jane and so with the first word y1 hat hardwired as Jane then this new network fragments can tell you was it probably if the second words give me input X and given that the first word was Jane and then same as above you can multiply with P of y1 to get the probability of y 1 and y 2 for each of these 10,000 different possible choices for the second word and then finally you do the same thing for September all the words from a down to Zulu and use this network fragment I just told us in as well so see if the first word was September what weather the most likely options for the second word so for this second step of beam search because we're continuing to use a beam width of three and because there are 10,000 words in the vocabulary you'd end up considering three times ten thousand or thirty thousand possibilities because there are ten thousand here ten thousand year ten thousand years isn't being worth been since the beam width times the number of words in the vocabulary and what you do is you evaluate all of these thirty thousand options according to the probability of the first and second word and then pick the top three so we're gonna cut down these thirty thousand possibilities down to three year getting down to the beam width parameter again and so let's say that these thirty thousand choices the most likely were in September and say Jane is and Jane visits sorry this bit messy but those are the most likely three out of the thirty thousand choices then that's what beam search would memorize away and take on to the next step of beam search so notice one thing if beam search decides that the three most likely choices for the first and second words are in September or Jane is or Jane visits then what that means is that it is now rejecting September as a candidate for the first word of the output English translation so we're now down to two possibilities for the first word but we still have a beam worth of three keeping track of three choices for pairs of y1 y2 before going on to the third step of beam search just wanna notice that because the beam width is equal to three at every step you instantiate three copies of the network to evaluate these partial sentence fragments in the output and it's because the beam width is equal to three that you have three copies of the network with different choices for the first word but these three copies of the network can be very efficiently used to evaluate all 30,000 options for the second word so just don't instantiate 30,000 copies in the network you need only three copies of the network to very quickly evaluate all 10,000 possible outputs that that softmax outputs a 4y 2 let's just quickly illustrate one more step of beam search so said that the most likely choices for first two words were in September genius and Jane visits and for each of these pairs of words we should have saved the way in computer memory the probability of y 1 and y 2 given the input next given the French sentence X so similar to before we now want to consider what is the third word so is it in September a in September Aaron all the way down to is in September Zulu and to evaluate possible choices for the third word you use this network fragment where you hardwire the first word here to be in the second word to be September and so this network fragment allows you to evaluate what's the probability of the third word given the input French sentence X and given that the first two words are in September in the English output and then you do the same thing for the second fragment so like so and same thing for Jane visits and so being searchable then once again pick the top three possibilities maybe things in September Jane is a likely outcome or Jane is visiting as likely or maybe Jane versus Africa is likely for the first three words and then it keeps going and then you go on to the fourth step of beam search that had one more word and on it girls and the outcome of this process hopefully will be that adding one word at a time that beam search will decide that Jane versus African September will be terminated by the end of and symbol using that in your system which is quite common that will find that this is a likely output English sentence and you see more details of this for yourself in this week's spring exercise as well where you get to play with beam search yourself so with a beam worth of three beam search considers three possibilities at a time notice that um if the beam wave was said to be equal to one so considers only one then this essentially becomes the greedy search algorithm which we had discussed in the last video but by considering multiple possibilities say three or ten or some of the number at the same time beam search will usually find a much better output sentenced and greedy search you've now seen how beam search works but it turns out there's some additional tips and tricks and refinements that helps you to make beam search work even better let's go onto the next video to take a look\n"