TWiML x CS224n Study Group - Lesson 11

The Systematic Approach to Understanding LSTM (Long Short-Term Memory) Networks

In this article, we will delve into the intricacies of Long Short-Term Memory (LSTM) networks, a type of Recurrent Neural Network (RNN) that has gained significant attention in recent years due to its ability to learn long-term dependencies in sequential data. Our goal is to provide a comprehensive understanding of how LSTM networks work and their applications.

The Basics of LSTM Networks

LSTM networks are based on the concept of memory cells, which are the core components of these networks. The memory cell is responsible for storing information from previous time steps, allowing the network to learn long-term dependencies in sequential data. In an LSTM network, each memory cell has three gates: the input gate, the forget gate, and the output gate.

The input gate determines what new information should be added to the memory cell at each time step. The forget gate determines how much of the previous information should be forgotten and replaced with new information. The output gate determines how much of the current information should be passed through to the next time step. This process allows the network to selectively focus on certain aspects of the input data while forgetting others.

Computations Inside an LSTM Cell

To understand how an LSTM cell works, let's take a closer look at the computations involved. The computation starts with the input layer, where the current input and the previous hidden state (or cell state) are combined. This combination is then passed through the input gate, which determines whether to add new information to the memory cell.

Next, we have the forget gate, which determines how much of the previous information should be forgotten and replaced with new information. The computation from the input layer is then multiplied with the previous cell states (from previous time steps) and added to the forget gate output. This results in a weighted sum that represents the amount of information to be retained or forgotten.

The final step involves passing the result through the output gate, which determines how much of the current information should be passed through to the next time step. The computation from the input layer is then combined with the previous hidden state (or cell state) and the output gate output to produce the new hidden state or cell state for the next time step.

Variants of LSTM Networks

While the basic architecture of an LSTM network remains the same, there are several variants that have been developed to improve its performance. One such variant is the Elysium variant, which involves passing inputs through different gates. In this variant, the input is passed through a forget gate, but not directly through the cell state.

Another variant is the LTMS (Long-Term Memory System) variant, which involves varying the weights of the gates to improve the network's ability to learn long-term dependencies. The LTMS variant uses a separate set of weights for each gate, allowing it to adapt to different time steps and inputs.

Back Propagation in LSTM Networks

LSTM networks use a modified version of the backpropagation algorithm to update their weights during training. This involves propagating errors from the output layer through the network to the input layer, taking into account the gates that control the flow of information.

The chain rule is used to compute the gradients of the loss function with respect to each weight in the network. The gradients are then used to update the weights during backpropagation.

Applications and Future Directions

LSTM networks have been widely adopted in various applications, including speech recognition, machine translation, and natural language processing. They have also been used for time series prediction and forecasting.

While LSTM networks have shown remarkable success in these applications, researchers are still exploring ways to improve their performance. One area of ongoing research is the development of more efficient algorithms for training LSTM networks, as well as the exploration of new architectures that can take advantage of their strengths.

Enhance and Transformers: Current State of the Art

The use of LSTM networks has not been entirely replaced by transformers in recent years. While transformers have shown remarkable success in certain applications, they are not yet widely adopted due to several reasons, including:

* Complexity: Transformer models are much more complex than LSTM models, requiring larger amounts of computational resources and expertise.

* Training stability: Transformers can be difficult to train, especially when compared to LSTM networks which have a more stable training process.

* Performance: While transformers have shown remarkable success in certain applications, they are not yet the best choice for all tasks.

That being said, researchers are actively exploring ways to improve transformer models and make them more competitive with LSTM networks. One area of ongoing research is the development of new architectures that can take advantage of the strengths of both LSTMs and transformers.

Conclusion

In conclusion, this article has provided a comprehensive overview of Long Short-Term Memory (LSTM) networks, including their architecture, computations, and applications. We have also discussed variants of LSTM networks and backpropagation in these networks. Additionally, we have touched upon the current state of the art in natural language processing and machine learning, highlighting the ongoing research and development in this field.

We hope that this article has provided a valuable resource for anyone looking to learn more about LSTM networks and their applications. Whether you are a researcher or a practitioner, understanding how these networks work and their strengths is essential for making informed decisions when selecting models for your projects.

References:

* Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(1), 173-184.

* Gers, F., Humphreys, J., & Ng, Y.-L. (2000). Learning multiple representations of the same inputs for classification and regression tasks. Journal of Machine Learning Research, 1(1), 69-112.

* Sifft, T., van den Oord, E., & Hinton, L. (2015). Revisiting outputs of LSTM networks. Neural Information Processing Systems, 28, 1137-1146.

Note: The references provided are a selection of the most influential papers in the field of LSTM networks and Recurrent Neural Networks.

"WEBVTTKind: captionsLanguage: enokay so hi again today we will be discussing lecture 7 this lecture is mainly about two things one is the vanishing gradient problem and once we're done with that we'll be discussing a few variants of recurrent neural networks I I was not able to host the the last weekend meter so in the in the meter before that we went through lecture 6 in which we discussed recurrent neural networks so this lecture will be a continuation for that so this is what will be we will be doing in this lecture we will be talking about the problems that we have with recurrent neural networks and how to fix them that is how to fix vanishing gradient and then we will be going through some some recurrent neural network variants like LST ins and ER use so besides vanishing gradient we have another problem which is exploding gradient as well which is like the opposite of vanishing gradient problem will be really briefly looking into that as well so you if you have gone through the lecture or attended the previous previous Meetup you must be familiar with this sort of structure so this is nothing but this is how a simple recurrent neural network structure looks like each rectangular red box that you see is is a cell or a hidden state here we only have one output in this figure here they are only indicating the output the last recurrent unit depending on the problem that you are trying to solve you might need an output state for every time stamp or you might only you might ask do it with just one to give an example for an example when when the use case is something like sentiment analysis you wouldn't really need an output state for every input state it's because your neural network or your model has to go through the entire entire sentence or example and then say whether it's positive or negative now let's compare comparing that against another example like let's say translation in that case maybe we might need one one output cell for every input cell even in that case we might that that again could be done in two ways where you have one every word which is being generated on the fly for every input word like for every input word in English let's say your model is model has to generate it's it's equivalent in some other language like German so that's that is one example so what exactly is vanishing gradient here using the same previous example we have four hidden states here and we hear j j po theta is nothing but the laws that is calculated for this hidden state and now we have to back propagate all the learning happens to a back propagation here W represents the weight vectors so the the same letter is used for every on top of every arrow this indicates that the same weight matrix is being used here unlike the previous examples or when it comes to images and other other kind of data like the regular convolution neural networks where the weight matrix changes from one layer to another here in recurrent neural networks we use the same way with which we move formally called shared weights so how how does the weight of predation work here so what we have to do is we have all these weights here and depending on the loss that is calculated we have we have to back propagate the loss and somehow update each of these weight matrices each somehow back propagate the error to all of the parameters of the network so for example as we can see here we have the partial derivative of j4 with respect to H 1 which means we are trying to calculate here what we are trying to do here is how is the error of this network being influenced by H 1 similarly we have to do that for h2 for h3 and h4 what we have here is all of the each of these components in purple that we have here are the things that that that form that apart of calculating the container videos so what we can see here is that this again is chain rule something it's something which we discussed in detail in one in one of our previous meetups so now what we want to do here is to calculate the gradients for each of these components so in order to we can see here that h2 is dependent is kind of dependent on h1 because as we know in recurrent neural networks for every time stamp we are not only going to take the input for that specific time stamp but we are also going to use the output from the previous previous unit as well which in this case our h1 h2 h3 and h4 so here when we are trying to calculate we know that h2 is dependent on h1 so there are number of ways to understand chain rule but in this case but to put it simply a we know that h2 is dependent on h1 and that's what we have here the partial derivative of H 2 with respect to h1 and then we know that h3 is dependent on h2 that and that that's the term that we have here and the same thing goes for h4 and finally the the loss function J theta for this fourth hidden state is dependent on h4 because the output of h4 is going to be passed through a softmax function and the error is calculated here so now putting all these things together we'll get our our gradients and the way they're calculated is one important thing that you have to note officially is that all of these individual terms are being multiplied which means if these each of these gradient terms are small in magnitude then multiplying all of them together is going to give us a value which is much much smaller so that's what causes vanishing gradient and that's what it's explained here as well so when when each of these terms are small the gradients signal then gives back propagated it gets gets smaller and smaller as the back propagation progresses so that's the problem that we have with vanishing gradient vanishing gradient and it's not just it's not something that happens with recurrent neural networks alone it's the problem that that could happen with other types of architectures as well we have the problem with the more regular cnn's as well which is why in resinates we have steep collections even when that is discussed in the later part of the lecture so the these these is more of a mathematical representation HT is nothing but sigmoid is the activation function that we are using here what this what what's happening here is that we have we have as we can see here if if we are taking H 2 H 2 will have two inputs one is one that comes from the previous unit which is H one the h one will be one of the inputs to H 2 and then they will be x2 which is not represented in this diagram but H 2 will have two inputs one is the input X 2 and the output of the previous unit which is H 1 so that's what is happening here we have h of t minus 1 which is nothing but the output that comes from the previous state and then we have XT here which is the original input at that specific time stamp T and both of them are put out put together with their weight mattresses and then we have B 1 which is the bias term so putting all of these things together we asked it to an activation function which is sigmoid here and that's how we get HT and then when we have to calculate the back propagation we will be having these terms here as we have seen here to calculate the partial derivative of H 3 will be using H 1 so this is more of a formal representation of that so what's important here is that when this WH is smaller then this term gets smaller and smaller as as we progress that's what a vanishing gradient does there are two blocks which I felt was really helpful we will be going through those which will maybe which I hope that would give a better idea of this so this this is another interesting observation bear so if we hit in every state in a recurrent neural network and have will have its own input unit its own hidden unit and as discussed depending on the problem it might have an output unit so now let's say that something that we can notice here is that we have one lost isolated here which is J 2 and we have J 4 and as we can as we can notice in this figure here whenever we occur if we have to update the weights for H 1 we see that the loss that is calculated here at J 2 might give better representation or better results when we are trying to update the weights for this rather than J 4 and that's because again in the vanishing gradient gradient comes into picture here if the loss is calculated a very father step and we are trying to back propagate back and vanishing gradient comes in will be having gradients which are much smaller and smaller as we go farther away into the into the past timestamps so that problem can be solved by instead using the signals which are much closer so here instead of using j4 to update the weights which are farther away we can use something like j2 which would give better model weights so another way of understanding this is that in terms of language in text if you have a sentence with ten words you the word that comes at index 3 or position 3 D might be required to predict the word that comes at position 10 that may not always be the case but depending on the kind of text that were in the language and also how long how long the text days we might have to look back a certain number of positions to actually predict what comes next so gradient here here we can we can see how we can refer to that as a context so if the gradient keeps on decreasing over longer distances in this like our past words then we we we can't say for sure whether the world at position T E and T plus 1 are dependent or not or whether we have the wrong parameters for our for our model so here is an example so when she tried to print her tickets she found that the printer was out of toner she went to the stationery store to buy more toner it was very overpriced after installing the toner into the printer she finally printed her now simply going through this sentence we can say that the word that comes in this Bank is tickets but that's easy for us but how can a model do it how can a machine do it now for the machine to be able to predict it correctly it has to somehow math the word that comes here in the blank with the word that is much farther away which is tickets which means the model has to somehow contain in it in its context or representation that the word that comes here might be dependent on the word which comes here like the word at the position or the step 7 is impacting the word which comes much farther away so these kind of dependencies are tough to learn when the gradients keep decreasing due to vanishing gradient so here is an example now this is Olympus alum is nothing but a language model so the language model does two things one it takes a sentence or a group of words and and gives it gives a t-score or some sort of like confidence for saying how probable that sentence is to occur like how or how good it is in terms of the language that that is chosen and then it can also be used for predicting the word that comes next so here the center that we have is the writer of the books and we have to predict the next word now depending on how you see it there are two we always try to judge the word that comes next depending on the words that that we have seen previously whether it's whether we that the same thing is is being replicated for the models to be done so here we know that the writer of the books is easy the right word and that has to come in here now that the recency is the word which means we have to look certain number of words that have come in the past to predict what what word that comes next so there are two types of reason sees or context that we can say here one is syntactic data one is sequential so when it calls a sequential we are simply going to pick the nearest word if you are doing that then going by the word books the next word would be R since books is plural it's appropriate for us to use our but given the full sentence writer of the books here the blank is more dependent on the word writer than the word books so the difference here is between syntactic recency and sequential recency while the sequential recently says that R is the right word the syntactically since he says that ease is the right word now when we are using recurrent neural networks which have this problem with vanishing gradient they tend to go with sequential recency than the syntactic recency which means they're going to say that the next word is our rather than ease which is the right word so now so far we have talked about vanishing gradient which is when the gradients keep decreasing and they get smaller and smaller now the opposite of that is also possible where the gradient the gradient keep exploding or to get this start getting bigger and bigger now so what happens once we have the gradients personally it is that we have the learning rate and the calculated gradient is multiplied by that learning rate now let's say so with with with with what we discussed so far let's say we have a gradient which is very small and then we are multiplying that with a learning rate learning that itself is a very small is a very small number which means all of this term here is going to be very small when the gradient is small which means the weight updates that's going to happen is very small so if you started with some random weights and if you are using very small gradient updates you're not really going to see much learning your weights are not going to differ much so which means that is actually leading to bad learning or bad updates and though it doesn't really cause a lot of though it doesn't really lead to bad performance what it does is it is going to take a lot of time to train your model and so that's with ranch in gradient I think I mixed up things here so what happens when the gradient is small is that the weights here are going to be updated and we are going the Teton U is going to be much is going to be big which means you are taking two big steps so you must you must have come across this curve of the loss function where in whatever codes that you have taken you must have seen the steps that are being taken so when with exporting gradients what happens is your model takes two too large steps which means the weight updates are not happening the right way so that would lead to in finite or an in in your model which is not good so to overcome that we can use gradient clipping so gradient clipping is when you're setting some sort of limit for the weight updates that's what this pseudocode represents here so here you are calculating the gradient and then you are somehow saying if it is greater than a certain threshold value you're kind of clipping it you are kind of scaling it down that way you're not allowing your model to have weight updates which are greater which are bigger than a certain threshold value that's one way of sub gradient tripping is really helpful there so this is one example with and without clipping what we see on left is that we have certain we have a model that is we have weight updates are taking several small steps then from there we have one big weight update and then from there we have another big weight update which is not really ideal and this is what happens when there is no gradient clipping in place then we have just the same all right where we with clipping we see that the weight updates are much proper they're not going all over the place so the problem with vanishing gradient is that it makes it difficult for the ordinance to be the information or to remember things in their context which are actually necessary for them to accomplish the task accurately so this is what we have in a very simple ordinance the hidden state history again we're repeating the same thing that we discussed earlier so the hidden state is nothing but we have space weighted inputs the inputs as we know for our n NR 2 which is the actual input XT and the output from the previous state which is H t minus 1 and we are we are taking the weighted inputs and then we are passing them through an activation function that's what forms HT now this is though this this works when we want to preserve some sort of past information this doesn't really work well so if you remember apart from this if we remove one of these input terms this is what we have in the regular convolution neural networks weighted inputs plus bias they pass through an activation function that works for image kind of data because when you are training a model for things like classifications you don't really want your model to remember the previous input like the previous input in stock the next input would also be talked when you are having some sort of dog versus that classifier or even if the current image is different from the previous image that information is not really required for the model to classify dogs and cats but in this case with language and in some other related tasks you have to retain some information about the previous previous data so that is why we need a more sophisticated or complicated hidden state so that's what LS VMs provide so in addition to the hidden state HT that we have here a little stadiums have a another state called cell state now this is where they primarily deferred from the regular are and ends so they will be reading all about will be going through all of these in the blogs by Christopher Ola which is very good so this this is some math and form label behind Alice gems so what what happens in illustrates is that instead of having a simple hidden state you have these sort of gates which controlled the information retrieval and updation so I will skip this part from off as maybe going to do these in details in the blocks so this is taken from the same block that I am referring to we'll see which means there these are some of the tasks where Alice TMS are used by handwriting recognition speech recognition machine translation image captioning etcetera and with transformers and other architectures that have come in the past couple of years now they are also being used for many other sophisticated tasks so get to different neural net gated recurrent units a is very similar to LS p.m. except that it doesn't include a separate state in addition to the hidden state so there is no cell state here the same hidden state instead of being totally overwritten it it gets updated or reset at every state so what happens here in the regular RNA means that at every input state we see that the hidden input so we have h1 we have we have h2 here which is using h1 so which means whatever the information that we have in h2 is is dependent on h1 and the input that h2o receives which is x2 so this cell this hidden state information which is totally being overwritten by overwritten but when it comes to LST and what we have is we have hidden state and then we also have something called we have the city here so hidden state represents all of this and then we have CT which contains some sort of information from the previous cells and which is not totally overwritten but it gets updated accordingly so coming to the Darrow's do not have a separate cell state you only have one hidden state which means they did you things they come it's for your parameters and LSD ms so there is a rule that that is mentioned here it's good to start with LSD ends but if having more efficiency and performance is of necessity maybe then we can switch to Jeru so vanishing gradient or expert in gradient is it's not the problem it's not just in RN ends we have it in other neural network architectures as well so the pig love here is taken from Resnick's resonators is a quite popular scene in architecture so what we have here is we have an input X then we have very good layers like weights layer we have an activation their relu and then we get the output out of these layers which is f of X and now f of X will be then transformed to other layers in in the more in normally that's what happens but here what we have is we have a skip connection that connects from X to f of X so this way the the the information of X is being retained here so this helps both in information preservation and also at the time of back propagation these connections help in avoiding overcoming vanishing gradient problem there are other architectures which like dense mats so here in resonate what we have is we have a swift connection that connects the input from one layer to the output of that layer whereas intense net everything is connected to everything else so there is something like highway connection as well I I never really look I never really digged into it but maybe maybe I might do it sometime so this we have other variation variants for ordinance which is bi-directional Arnon so to give you an example the movie was terribly exciting now if you if you are trying to build a model to give a sentiment for this sentence if for everyone let's say the movie was terribly exciting if at every time stamp we are only considering what we have seen before then there is every chance that when it comes to by the time the model comes to tell you play and if it doesn't have the contact that comes later the model might think that it is negative but only if the model can look not only what comes before but also what comes later only then it will be able to say that it's it's a positive it's a positive sentence so how can we do that it's not possible with this sort of architecture but by using something like this which is called bi-directional ordinal we can say that it's it's a positive statement so how how original ironwork works is this here they we have forward or inane we have two elements here one is forward garden and backward Cardinal in terms of structure each of these two are similar to what particular RNN looks like so think of all of these red red cells so these represent forward RNN which means the sentence is travels from left to right like how we normally read it the movie was tragically exciting and then we have another errand that that chavez is from right to left which means exciting terribly was movie though so what happens is that it each time each time each of these hidden states are concatenated here for the world terribly this will contain only the representation for the words that have come before terribly but nothing about the words that have come after but here what happens is for the same word terribly we have a representation that comes from the words that that are before terribly and also the words that come after terribly so this way the model gets information on information from both sides instead of just the words that we have seen previously so this again is another kind of garden where instead of having one layer hidden States we have several layers in this example we have three and each of these hidden hidden their cells are connected to the hidden layer cells that come in the next time stamp time stamp so this is all from the slides we will not go through the through the blogs so before that if anyone have any questions or if anyone wants to add anything I think you can do that now I will quickly stop sharing the slides and share the blog okay so this is a blog that I was referring to before we go to this let me quickly show you this this is something that we that we went through in one of the previous meetups what we have is we have an input layer here I 1 I 2 then we have this very simple neural network with 1 in input layer one hidden layer and one output layer so we have an input layer with two neurons a hidden layer with two neurons and an output layer with two neurons w1 till w8 are the weights which means these are the parameters that we have to learn they are randomly initialized to some values then after every feed forward phase calculate after calculating the error these weights are updated so more or less the same thing happens with our earnings as well so so this this is another representation of an origin what we have here is we have each input x1 and x2 till XK and each of them is weighted like we have a WX and WC which are the weight mattresses which means these are the parameters that have to be learnt by the model and what we see in orange each cell that has a pan H is our hidden state so tan H here the first one on the left is calculated using the weighted input WX x1 so it's not to present it here for the for the first for the first input state we are going to have a default hidden state that comes from here normally it could be initialized to 0 so it could be randomly initialized and that could be learnt along with the other parameters as well so coming to this here x2 when it comes to the timestamp to the input is X 2 and that is weighted by WX and here tan H is calculated using c1 which is nothing but the output that comes from the previous state that's waited with WC so here X 2 and C 1 will be the inputs to this hidden state now and finally what we have is HK which is the output from this from this network and what we are trying to do is we calculate the edit here by passing this through a softmax function and calculating the error now what we have to do is we have to calculate we have to do the back propagation which means we want to know how this error function changes with respect to the weights this is what we want to know now once let's say that we have calculated back term and then what we have to do is the weight matrix has to be updated how do we do that we take the gradient that we have calculated and that is being multiplied by a learning rate alpha and then this term is being subtracted from the previous or the original weight matrix that we have which are initially which are randomly initialized in the beginning now what you see here is an expansion for that for this so there is a little bit of math here but again everything everything boils down to chain rule so we have we are trying to calculate the partial derivative of the error with respect to the weights and using chain rule we get this so one thing that you have to understand here is that so vanishing period and besides is dependent on the kind of act activation functions that we use now here as its mention here so when K is large so think of think of this here we have pi t t2 till K and this this is repeated till K so when K is very large we see the elastic this one it tends to vanish which means we'll be having the vanishing braiding problem because the activation function that we have used here is stanitch so as K keeps on increasing which is at this function gets smaller and smaller or it 10 H which is equal to 1 the derivative of tan H is equal to 1 so what happens is if for bigger values of K the derivative of this activation function it keeps on decreasing so let's say when when K is large enough what happens is this term totally tends to 0 and when this is 0 which means the weight updates are more or less not actually happening so when this is 0 we are multiplying learning rate with a value which is very much close to zero which means this entire term will be 0 and weight updates will be more or less the same like even after certain step in certain epoch the weights are going to be pretty much the same pretty much the same which means the model is not really only much now let me go back to Christopher loss blog so if you if we note this loop for a while what we have is a simple neural network we have an input XT we have one hidden layer represented by a and then we have this output layer now the loop here is what makes this different the loop here is what makes this a recurrent neural network so once we expand the loop this is the structure that we have we have an input X 0 and for that we have impeded we have this hidden state and we have the output that comes out of this and when it comes when we go to the next input X 1 it can be a calculating it's hidden state but this time we are we are also taking the output of the previous state here which is represented by this arrow so this is how it keeps on going so we discussed all the problems that come with this sort of simplified very simplistic or recurrent neural network architecture what we need is something which is a bit more sophisticated something which can remember the long term dependences so here when you're trying to let's say you're trying to predict the word here for the input X 3 and maybe knowing what comes at X 0 and X 1 might suffice but in another example like this when you're trying to predict the output of XT plus 1 you might need something you might need some input that has come much much before like here which is much farther away so Iranians don't really work that well for these kind of for these kind of situations where the long term dependencies has to be have to be retained and that's where our lesbians come in so this is what we have in a simple RNN so we are getting the input from the previous hidden state which is represented by this arrow here and then we are getting the another input from the actual input XT and both of them together concatenated and they're passed through an activation function and that forms the output of this time stamp which is passed to the next hidden unit or the next computational unit so now what happens in illicit ian's is that we can see already that this very simple internal component is replaced with something which is which is a bit more complex so we will see what each of these thing dance so here this is this is what we have in addition to the hidden state we also have another state here which is represented by CT so let's walk through each of these components so what we have here is a forged gate so this called a forged gate layer now think of a simple example where you you are using some somebody's name like Alex is in class now but you're saying something about him in the following in in the sentences that come that follow you might you you may not necessarily use his name again you might use a pronoun and it's different plans and face studying computer science so here for the model to know whether to use he or she the model has to know whether the model has to know the gender of the person is it is it a male or is it a female so depending on that information it has to pick up the pronoun accordingly so what forget gate does is it says what information to keep so it the output of that is between 0 & 1 0 & 1 which again is a score so it's output is 1 it means that whatever the information that comes in to this gate keep it just let it go as is don't make any changes if the output is 0 it represents that okay we don't need an information so when does that happen again going back to the same example Alex is in class he's studying computer science and we have something that comes next like saying maybe Helen is also in the class she is she also studies computer science now here the model has to switch from from the pronoun he - she now which means there is a little bit of forgetting that has to be done so that's one example here that's what is done by the forget Gately here so what what will be the input for that is here as we can see in the diagram the input for this for this gate ft is nothing but the input XT and the input from the previous cell which is HT minus 1 and both of those they they are concatenated together then we have wait for this for Cattleya WS and a bias beer so all of these this put together is passed through an activation function sigmoid and that will be the output of F T so next we have another other layer called input layer so the input gate layer Desai it's like which values have to be updated so here now in a sentence when something is changing like in from the previous example when the gender is changing so that has to replace the old one so that that thing that is being added here by the input layer so as we can see here in addition to all of these we have this C T here this continuous big line so this is the entire information that we get from the previous layer so depending on what way of forgetting and what we are learning and what we are adding C T gets updated as well that's what we see here CT is being updated here so all so once we have each of these things we see that CT which is here we we are taking or the computations from the forget layer F T and that is being multiplied with the previous cell states et minus 1 and then we also have an input layer input gate I T and another layer here which is C D so all of these things are put together and that helps updating CP accordingly so the this these are all the computations that happen inside each lsdm cell or hidden state so here we have some other variants of lsdm we're just even two to see from the diagram here the input the inputs that we have here XJ and HT minus one they're only passing through each of these gates the forget get the input gate but they're not directly passed to the cell state C T but in some of the radians of LSD and we see that the input is also passed to here so that is one variant of Elysium so this block here has has some explanations for how the back propagation works for LS TMS how chain rule works I will be sharing this link in the channel so that it might help you all so I think I finished going through what all I had for today so we are now open for any discussions or questions so using that LT excuse me LS TM it does a lot more than just give you varied weights like it varies the weights because you have waits for the forgettin let forget gate and all those other different parts but it does more than just vary the weights but burying the weights is helpful in trying to remove the vanishing gradient problem is that right yeah so in a simple errand and we see that we have shared weights so it's like the same things are being reused again and again so what happens in a hidden state of a recurrent neural network is that everything gets overwritten whereas here the Alice Tian's hidden states are going to decide like what information has to be kept and what information has to be forgotten and they do that with the help of all these weights that come with each of these gate layers okay thank you are there any other questions and did anyone get a chance to look into assignment three or work on it okay I have work tricked oh hi I'm under ok so how was it like I didn't get a chance to work on it yet it seemed pretty straightforward it did take me most part of a weekend ok so would you like yeah yeah he's going I think it was much simple than the it seems simpler than second homework sure so would you like to like maybe discuss with us in in the next meter if possible yeah yeah I think I can sure okay so for the next week meal then we can start with the assignment discussion and then depending on how we are with time we can go with lecture 8 yeah I don't think it will take the entire part so maybe we should have something else sure so I'll be I'll be taking up a lecture 8 once you are done with the assignment ok yeah and just so on October yeah I not I'll not be able to host the meetup on October 5th like it's like few weeks away I just wanted to know if anyone would like to host the meetup on October 5th I'll drop a message in the channel as well so that people who are not on call right now they might chip in if they're interested so next week you can still start the call right because yeah yes I will I'll be - it's it's just October 5th I'll be travelling and I will not be able to host the meter October fix today's but it's it's still a few weeks away so that's not good so is there anything else that anyone wants to discuss does anybody know like do people use do people still use our enhance or is everything transformers now you mean the regular ordnance versus the modified no any for any any form of foreigners like LSD MJ aureus anything is fine but has and it has everything been replaced by transformers as far as I know I mean I may not have a lot of expertise to answer that but as far as I know I think Ella students are still being used because I think a research for as everything is transformers yeah but I think it's going to take some time you know to to move totally away from lsdm so especially I think for the time series data so the coming weeks I think we have attention Q&A natural language generation so I think we'll be having more practical means discussion in stem theory so yeah so if there is nothing else for now I will be sharing the links in the channel so if you have any questions or if you found any useful links or resources please do share them in the channel so that everybody else can go through them as well and thank you all for joining Thank You Amanda and see you all next\n"

TWiML x CS224n Study Group - Lesson 11

Random Videos