DeepMind x UCL RL Lecture Series - Function Approximation [7_13]

**The Power of Deep Q-Networks: A Fundamental Understanding**

Deep Q-networks (DQN) have revolutionized the field of reinforcement learning, enabling agents to learn complex behaviors and make decisions in dynamic environments. In this article, we will delve into the inner workings of DQN, exploring its key components, update functions, and innovations that have made it a cornerstone of modern reinforcement learning.

**The Basics of Deep Q-Networks**

A deep Q-network is a type of neural network designed to learn an action-value function, also known as a Q-function. The Q-function estimates the expected return or reward for taking a particular action in a given state. In traditional Q-learning algorithms, the Q-function is updated using a simple online learning update function, which looks at the current observations and updates the agent's policy based on the estimated Q-values.

**Stacking and Processing Observations**

One key innovation of DQN was the introduction of stacking and processing raw frames. This involves taking individual observations, or "raw frames," and stacking them into a single input to the neural network. The processed output is then used as the agent's state, providing a more comprehensive representation of the environment. By stacking and processing observations, DQN agents can learn to recognize patterns and relationships in the data that would be difficult to detect using individual raw frames.

**The Role of the Replay Buffer**

To overcome the limitations of traditional Q-learning algorithms, which require the agent to replay all previous experiences, DQN introduced a replay buffer. The replay buffer stores past transitions, but only up to a certain limit (e.g., 1 million). This allows the algorithm to learn from new experiences while still utilizing the knowledge gained from previously seen states and actions.

**The Target Network**

Another innovation of DQN was the introduction of the target network. The target network is a separate neural network that mimics the behavior of the main Q-network, but with a different set of weights. The target network is used to bootstrapping the learning process, allowing the algorithm to learn more efficiently and effectively. By having a target network, DQN agents can update their policies while still leveraging the knowledge gained from previous experiences.

**The Update Function**

The update function in DQN is similar to that of traditional Q-learning algorithms, but with an additional twist. The update function uses the target network to compute the Q-values and then updates the main Q-network using a variant of online Q-learning. This allows the algorithm to leverage the knowledge gained from previous experiences while still adapting to new data.

**The Optimizer**

DQN agents use RMSProp as their optimizer, which is an efficient algorithm for minimizing the loss function. The RMSProp update rule is designed to adapt the learning rate based on the magnitude of the gradients, ensuring that the agent learns effectively and efficiently.

**Reinforcement Learning vs. Supervised Learning**

The introduction of replay buffers and target networks in DQN has some similarities with supervised learning. By making use of experience replay and using a separate target network, the algorithm can leverage the knowledge gained from previous experiences to improve its performance. This approach is particularly effective because it allows the agent to learn more efficiently and effectively, while still utilizing the benefits of deep learning.

**Deep Learning Aware Reinforcement Learning**

The work on DQN has shown that reinforcement learning can be made more effective by incorporating deep learning techniques. By modifying the network architecture or optimizer, reinforcement learning algorithms can be designed to take advantage of the strengths of deep learning. This research has opened up new avenues for exploration in the field of reinforcement learning.

**Conclusion**

Deep Q-networks have revolutionized the field of reinforcement learning, providing a powerful framework for learning complex behaviors and making decisions in dynamic environments. By understanding the key components and innovations of DQN, we can appreciate the significant contributions it has made to the field of reinforcement learning. As research continues to evolve, we can expect to see new breakthroughs and advancements in deep reinforcement learning that will continue to shape the future of this exciting field.

"WEBVTTKind: captionsLanguage: enhi and welcome to this seventh lecture in this course on reinforcement learning my name is hadofan hustles and today i will be talking to you about function approximation and reinforcement learning in terms of background material um chapters nine and ten from saturn embargo i are highly recommended and we'll cover some material that's captured in chapter 11 as well and in terms of context we are always considering this interaction loop where there's an agent interacting with the environment by picking actions and observing the environment and then reinforcement learning can be considered the science of learning how to make those decisions how can these agents deal with this interaction stream and somehow big actions that give it larger and larger rewards the rewards not depicted in the figure could be inside the agent's head or they could be part of the observation depending on how you want to model your problem inside the agent in any case there will be some sort of a policy that determines the actions and in addition the agent might have value functions or potentially models as well the general problem involves taking the kind into account time and consequences because the action influences the immediate reward but also potentially the agent's state and also potentially the environment state which means that certain actions might have very long-term consequences in addition these consequences might not be immediately obvious for instance an action might change some internal state variable in the environment and it might not be immediately observable what that means or what that matters but it might influence later observations or rewards now in this lecture we're going to talk about function approximation and first we're going to talk about why we need that so first we're just simply pointing out in some sense that the policy the value function the model and the agent said update and all of these things that are inside the agent can be viewed as being functions for instance a policy maps the agent's state to an action or to a stochastic policy from which you can sample an action a value function maps a state to a value which is an estimate of the cumulative reward into the future the agent state update itself maps the previous agent state and the observation into some new agent state so these are all functions and we want to learn these from experience why do we want to do that well multiple reasons one of which is flexibility as we talked about in the very first lecture i quoted a paragraph from a paper by alan turing there where he argues that maybe it's easier to program something akin to the mind of a child than it is to program something into the mind of an adult because maybe it's actually easier and also maybe better actually more flexible to write something that can learn from experience then try to list all of the rules you may have learned from experience in addition if there's too many states we will need some sort of approximation this is a separate point because this is no longer just about learning but this is really about function approximation that this is important and when we use neural networks to represent these functions then this subfield is often called deep reinforcement learning these days that term is actually relatively novel it's like seven to eight years old but the combination of reinforcement learning with neural networks is in some sense a quite natural one and is actually fairly old and was has been suggested at least in the 70s already so in this lecture we'll concretely consider predictions so value functions including some value-based control so using these predictions to come up with policies in upcoming lectures we'll talk more about off policy learning approximate dynamic programming which is uh the term we use to refer to this more generic like field of study where we're not necessarily sampling but we're just considering what if you have approximations in your dynamic programming algorithms this could include sampling this could include function approximation but it could also be from other sources potentially and this will allow us to talk about these updates in a theoretical sense and analyze what they are doing this will also be an upcoming lecture we'll also talk about learning policies directly so as i mentioned the policy itself can be considered a function so you could consider updating that function with policy gradients we briefly talked about those in lecture two we'll come back to that in a later lecture and we'll talk about model based rl as well in this lecture we're focusing on predictions because it's nice to be able to focus on something when we talk about this function approximation but a lot of the insights they do transfer to these other cases as well as we will see and in terms of motivation well why do we want to do that because we're ambitious we want to solve real problems we want to solve large problems and actually even small problems are large problems because the game of backgammon one could argue is not particularly large compared to the real world it's just a game but turns out that gammon already has 10 to the 20th different possible states so if you try to enumerate all of those in a large table in memory you quickly run out of space if you try to do that for these type of problems because if we then go to a different game like the game of go we notice that this actually already has 10 to 170 states that's an enormous number you can't feasibly store that in memory in a computer and then try to learn a value function for each of these separate states actually it can even get worse because for instance let's consider helicopter flying in which the stage spaces could arguably taken to be continuous so you could view the state space could for instance be the position of the helicopter in space so there's this three-dimensional space location but in addition maybe there's other inputs as well like the there might be a wind sensor that tells you from which direction the wind is blowing and how fast it is blowing and these inputs could all be real valued numbers which means that the save space is inherently continuous which means that if you would like to enumerate them there are infinitely many states and of course maybe as the ultimate example you could think of a robot in the real world where the environment is actually equally big as the real world is and in that case it's obviously impossible to enumerate all of the states because any computer that would have to do that would have to live in the real world and hence should be necessarily much much smaller than the world is so this is a common thing that we have in mind when we think about reinforcement learning we want algorithms that can be applied to the setting where this environment is just humongously large and the agent must be necessarily a lot smaller so the mind of the agent including its memory capacity must be necessarily smaller than the environment there are of course small problems small leash problems in which you could still enumerate all of the states and in that case maybe you could try just doing doing tabular approaches but in general we want these algorithms to also apply to these larger problems so this is our motivation for why we want to consider function approximation and the intent is of course to apply the methods that we've already been talking about for prediction and for control so let's talk a little bit about a value function approximation now so far we've mostly considered lookup tables although i've already mentioned linear function approximation and other things in earlier lectures but in the tabular case each state has a separate entry and we just update that or each state action pair has a separate entry but the problems with that is that there are too many states as i mentioned before so you cannot fit that in memory that turns out it's not the only problem a different problem is also too slow to learn the value of each state individually so think of it this way so if you have a problem that is not too large let's say there's a million different states we can store a million numbers in modern computers that's not a problem at all but then still if you would update each of these state values individually and separately then it would take quite a bit of experience to learn reasonable value estimates for each of these states and indeed this feels inefficient if these states have some commonalities which we would then completely be ignoring so if these states can instead be viewed as being similar to each other then maybe you would expect or maybe you would want the value of a certain state to be updated when the value of a very similar state is updated this is called generalization in addition to those two points individual environment states are often not fully observable so the environment might not just be big it might also not be fully observable from your observation stream so we need to somehow infer what what's going on in some sense or at least up to the point where we can make meaningful meaningful predictions or pick meaningful actions so our solution for all of these problems essentially will be to introduce function approximation for now we're just considering the predictions as mentioned at the beginning we're going to consider value prediction in this lecture and what that means is we're going to introduce some parameter vector w as before where now the value function is parametrized with this w it's a functional state somehow where the state is the agent state and we're either approximating the value of a certain policy or maybe of the optimal value function and the algorithms we'll get to in a moment but the idea is that we're going to approximate these and then update the parameter for instance using monte carlo algorithms or td learning or any other similar algorithm that you might want to use and then hopefully if we pick our function class correctly and we'll talk about that more we will be able to generalize to unseen states so in any case if you have such a function the values of these unseen states will be well defined you could of course also have this for a table you could just if you know all the states that you could possibly end up in you could at least initialize the values to be for instance zero arbitrarily but if you just have a function class that automatically maps every possible inputs to a valid outputs then you automatically have this mapping and you have a valid state estimate for every state in addition um i just want to briefly talk about the asian states but i'll actually not go into a lot of depth in this lecture i'm just going to mention it because it's very very important but we'll get back to that in a subsequent lecture so i mentioned that the one of the problems might be that the environment say is not fully observable so essentially the environment state is not equal to the observation then as always we're going to rely actually on the agent state instead and we're going to sort of the agent set update function here briefly which we could also consider to be a parameter parametric function of its inputs which are the previous agent states the action and the observation and potentially the reward as well if that's not part of the observation and it has some parameters here denoted omega now we already discussed you could pick this to be something you could pick this just to be for instance the observation the agent states at which point the omega is maybe not relevant maybe there are no parameters for that function but maybe in general that's not very particularly useful right if you look at just the observation you might not see everything that you need to see so this agent states which we here denote with a vector notation but we will continue to use the random variable notation as well should contain everything that's useful for you to make value predictions or to pick your policy and that means that the agent update function is actually quite important and might need to capture things like memory for instance it might be that you have to remember what you've seen before we talked about an example in the first lecture where maybe a robot has a camera can only look in front of itself can't see what's behind it but maybe it needs to remember what is behind it to pick suitable actions this would be captured in the agent state and maybe we should think about learning that in this lecture we won't go into that but i just wanted to remark it also to be clear that in the subsequent slides whenever i have a state it's always going to be the asian state and this is basically just going to be some vector or the observation okay now we're going to talk about a couple of different function classes in some a little bit more detail just to talk about the differences between them so we started by talking about the tabular case right where we just have a table so our function class here is tabular and then one step up from that would be state aggregation so in a table we have an entry for every single state for instance let's consider for simplicity that is fully observable so you literally observe the environment state and you just have an entry for every environment state in state aggregation you would instead have an entry that merges the values of a couple of states so basically we just partition the state space into some discrete small set potentially so that we can maybe learn the values of these partitions instead of learning them for every individual state and then indeed we would get some generalization because all of the states in the same partition would then be updated with the same algorithm and hence would obtain the same value so if you reach a new state in a certain partition that you haven't seen before then it would automatically already have a value and it might be a suitable pattern now one step up from this will be linear function approximation and i want to for now consider a fixed agent state update function and then a fixed feature map on top of that so this is why why things will be linear because we're going to assume that the agent set update function for instance because you speak the observation or do some other fixed function there doesn't have any parameters the feature map doesn't have any parameters and instead all that we're going to learn are the parameters of the value function which uses those features as input so there is student agent update function it might be a simple one but we're just going to ignore that in terms of the learning process now note that state aggregation and tabular are actually special cases of this more general class because in the tabular case we could just consider the feature vector to have as many entries as there are states and then have the entry of these current states would be one whenever you're in that state and all of the other entries would be zero so it's a one volt representation as they sometimes call this where the state is kind of picked out by the feature vector now state aggregation is very similar except that the the size of the feature map is no longer necessarily the same size as the number of states so n what we call that here on the slide the number of outputs of this feature mapping so the number of features in your feature mapping would not be equal to the number of states would typically be much much smaller and you could indeed even apply this to continuous state spaces but then it would still be a one-hot representation when we do state aggregation so there's some mapping that says oh if you're in this states then this entry must be one and everything else must be zero um and if you're in similar states maybe the same entry will be one and everything else will be zero but if you're in a different state then a difference further away for instance right then a different entry would be one so it's still a one note representation but it's a smaller feature vector than uh if you would do the fully tabular approach and both of those are special cases you don't need to have a onenote feature representation you could consider much richer function classes here now one step up from that if we're going to be even more flexible would be to consider differentiable function approximation in this case our value function will be a differential function of our parameters w and it could be nonlinear for instance it could be a convolutional neural network that takes pixels as input you don't need to know right now what a convolutional neural network is if you don't know what that is it's just some nonlinear function and one way to interpret this is that you can consider the feature mapping now no longer to be fixed so we take whatever the agent status as input and for instance the input could just be the observations could be the pixels on the screen when we're say playing an atari game or something like that and then instead of having a fixed feature mapping you could consider this to be a learned feature mapping where something maps it into features and then there's an additional parameter on top of that that maps it into values in the notation we're going to merge all of those features into one big vector w but it won't be a linear function it will be some nonlinear function where these parameters are used in different ways so what we're requiring here why we call this differential function approximation is that we can still compute gradients because we're going to consider gradient based algorithms that's not a strict requirement but it is one that we're going to use here and it allows for very rich function classes such as deep neural networks okay in principle any of these can be used but it's good to acknowledge that reinforcement learning does have certain specific properties or maybe often has certain specific properties which might interact with the function class so it's good to be aware of those the first one that i'm going to mention is that the experience is not identical and independently distributed with which i mean that if you're going to learn online from a stream of experience it's just coming at you then successive time steps are actually correlated this is not necessarily a problem i'm not saying it because it's necessarily a problem but i'm saying it because it's slightly different from the case in which many of these functions have been developed for instance deep neural networks were originally often developed and tuned in cases of classification for instance amnesty digit classification and then some choices that were made there might not be completely appropriate for reinforcement learning because we're breaking some of the assumptions such as that we can sample each of these learning examples independently in addition to this the agent's policy affects the data it receives what i mean with this is that the reinforcement is an active problem and this has multiple consequences including that actually one benefit of this is that you can maybe actively sample data in a way that is useful for learning your function you could literally have an expiration technique that maybe seeks out certain parts of the state space in order to improve your value estimate specifically over there you can't actually it's not active learning you can't just sample any sample that would be uh within the field of active learning that's the typical assumption that you have a database and you want to um indeed actively pick which samples but you can pick from all of them that's not typically the case in reinforcement learning because the way we're interacting with the world so if you want to sample a certain situation you actually first have to go there but it is an active process and that's interesting but it also causes some issues for instance the regression targets are often non-stationary this is for instance because we're doing control like we might be changing our policies as we go and this might not just change the targets for our regression our value targets for instance if we bootstrap but it might also change the data so it's actually non-stationary in more than one way in addition like i said the bootstrapping is important so even if the policy is fixed so even if your policy is not changing the targets might change because your value estimates that you're using to construct your targets could be changing so in td we're using our value estimates and those initially are not very good but later they might become more accurate in addition to this the world itself might be non-stationary for instance there might be other learning agents in the world and maybe we also want to deal with those cases and similarly but differently the world might just be very very large so this might mean that you're never actually quite in the same state and these last two points actually quite closely related because if we allow for the world to be partially observable then if the world is sufficiently large it might appear non-stationary to you because you might think oh i'm going to that room i'm going somewhere else and going back to that room things have slightly changed maybe the sun is now setting whether it wasn't setting before or maybe all of a sudden there's a different agent in the same room and this is not maybe literally non-stationary maybe the actual underlying say physics of the world are not changing but you don't know all the the the latent variables that are in the environment states that make these things change so in some sense non-stationary and very large are in some sense similar but of course they have different connotations you could have a very tiny non-stationary problem as well so which function approximation should you choose we covered a couple of function classes well it kind of depends on your goals and one distinction to make is that for for instance the tabular case we have really good theory we understand these algorithms really well we know when they converge we know where they converge they're quite stable the assumptions for converges are not very strict so you can fairly easily apply them and be pretty certain that they will learn something in addition i didn't put this on the slide but they actually also tend to learn relatively well and fast if the state space is small enough but they don't generalize they don't scale you can't apply them to very large problems as discussed now in the linear case we're already a little bit more flexible on our function class we still have reasonable boost theory so we understand these methods quite well we know when they converge and where they converge in many cases but of course we are dependent on this fixed feature mapping so we do require very good features now this is a bit of a subtle point because the people are like people are still debating this in some sense because there's multiple aspects to this one is that of course you're going to be limited by the features that you have and maybe that would be a good reason to learn these features i i most people these days are in that camp but you could also argue that maybe it's okay to have somewhat poor features as long as you have sufficient of them so the other alternative would be to have linear function approximation but with a very very rich set of features and then maybe you still have a good enough set of features that you can pick out the ones that you really can use in the context that you require them and then of course there's nonlinear function approximation which is less well understood but it scales really well and it tends to work quite well in practice until it doesn't and then sometimes we don't really understand what's going on in some edgy cases but of course our experience with this and our understanding of this has been growing quite a lot over the recent years and it's really flexible and importantly it's it's much less reliant on picking good features by hand and i personally find this to be a very convincing point and a very good point because one thing that this means is that you can apply these methods to problems that you don't really understand that well but as long as you have a well formulated reward function that you are certain is the right reward function you could apply these reinforcement methods and then still get better find good policies for these domains without really needing to understand what appropriate features would be and in fact we see more often these days over and over not just in reinforcement learning but also in deep learning and related fields when applied these algorithms that don't try to hand engineer features but instead try to learn them from data tend to outperform methods in which these features are hand engineered so that's interesting and it relates again back to that point that insuring was also trying to make so deep neural networks are typically used in this last category of non-linear function approximations and in some sense depending on how you define these things some people would argue that maybe any nonlinear function could be argued is in some sense some sort of a neural network it might be a weird one another very strangely structured one but in some sense these are somewhat similar so sometimes neural networks are used almost synonymously with nonlinear function approximation and these tend to perform very very well in practice and therefore remain a very popular choice and there's also lots of research of course happening trying to understand these methods better okay now this brings us to our uh next topic where we're going to talk about gradient based ref well be creating based algorithms first a little bit in general so we're going to talk a little bit just a very brief primer on gradient descent and so basically descent um and we're going to talk about this because because of course we're going to use that for value learning in a moment so for now let's consider some arbitrary function j this is just some function only of the parameters right there's no state input there's just parameters as input and we're going to consider it's gradients just as a reminder a gradient is a vector and it contains as its elements each of the partial derivatives for this function j with respect to that one parameter out of w so ga here is a scalar function with which i mean that the parameter vector can be large but the output is just a number and we're good considering the gradient of that that function j with respect to the parameters which we of course should evaluate at certain parameters right the gradient depends on where w is at this moment and the goal could then be to minimize j so one way to do that is to move our parameters w in the direction of the negative gradients and one way to do that would be the algorithm on the slide there's actually different ways to do that as well that are similar but slightly different differences how they pick the step size or exactly how they move or the direction they move but a lot of the algorithms derive essentially from this main algorithm which is the gradient descent algorithm where we have some step size parameter and we just move slightly in the direction of the negative gradient the one half here is optional maybe i should get rid of that this is kind of assuming that there's a square lost that j is some sort of a square and that therefore the the the one half will cancel out with that so the step size parameter should be in some sense small because then if we take a small enough step we're in some sense certain that we're going to decay as long as this function j is smooth enough so you what what i mean with smooth is that the function counts have a very big discontinuity so if you know your function is somewhat smooth then you know if your step is small enough that the gradient will have pointed in the right direction and indeed it will go down now of course in practice there's a bit of a trade-off happening here because you don't want to make alpha too small because then learning is very small so instead you typically tune this to get good performance now if we plug in value functions so here we have the square loss that i promised for instance we could consider this one where we are simply interested in this one number right again j doesn't take serious input it's just a number that depends on your parameters w and the way it depends on that here we define it to be the expectation over some distribution of states for instance due to your policy and of course the mdp dynamics and we want to consider the square difference between the actual true value of your policy so we're doing prediction in this slide and our current estimate and that's something that we might want to minimize right this will be a predictive loss and then we can just consider this gradient descent algorithm here the one half comes in handy because it cancels out with the two that comes from taking the derivative of the square we note that the distribution in this case does not actually depend on our parameters so we can just push this gradient into the expectation and then we get this algorithm this update if you will where the update to the weights delta w would be small step size alpha times the expectation under that distribution over states of the target which is our true value function for for that state minus our current prediction times the gradient of that current prediction this is an expectation so we can't actually typically use that until unless we really have the distribution over states but if we want to use this in practice we can sample this and that means that we're going to sample the state we're in and we're also going to sample this true value of the policy and one way to do that would be to plug in the monte carlo return which is indeed an unbiased estimate for the true value of the policy so this update this stochastic gradient descent update on the slide here is an unbiased estimate of the gradient descent algorithm one line above it and that means if we take our step size to be small enough and we take enough of these steps that we will on average move in the direction of the gradients and we will also reach the same basically conclusions we reach the same solutions as as the great as the full gradient algorithm would so in that sense the caster gradient descent is fine and it's just another reason to pick a small step size so not just because then the gradient is valid but now also to average out the noise in our estimate of the gradient so we are often a little bit sloppy with notation um this is just kind of a warning for that so whenever we drop terms so let's say we for instance write the gradient of v and we don't have w's anywhere and we don't specify where we're taking this uh then typically we mean when we have a gradient we mean with respect to the parameters so we have gradient of v just means with whatever the parameters of v are so if those are w then this would mean the gradient with respect to w and then we're also typically taking that at whatever the current value of these parameters are so this is something that you'd actually have to be explicit about the gradient itself outputs in some sense a function and you still don't have to plug in the parameters where you're evaluating that gradient and this is typically done in notation with this bar where the subscript tells you where we're evaluating the gradient but people are often very sloppy with the notation in this and there's lots of shorthands for instance people might might write instead the gradient with respect to wt of the value function of wt instead of having this bar notation and that's fine as long as from context is clear what we're talking about so i just wanted to highlight that here so that you have a feel for oh if there's something that gets dropped from a notation how should we interpret it it should be clear from context okay and now we're going to go and talk a little bit more in depth about linear function approximation which is going to be useful to understand the algorithms and also some of the limitations of the algorithms so we talked about this previously but i'm just going to reiterate a couple of things that we said before we're going to represent the state by some feature vector where the state sorry this feature mapping x is going to be considered fixed so we have n elements here and different features these are just numbers and they're functions of state and it might for instance take the observations as input or more generally your agent states and outputs some feature vector of a certain size and we also introduced some shorthand where if we apply this function to state at st so at time t we can also just write x t and we can kind of pretend that these are just our observations as far as the algorithm is concerned the algorithm only gets access to xt it doesn't actually get access to st anymore it just gets access access to these features instead so for example these features could include distance of a robot from some landmarks or it could be trends in the stock market or it could be peas and palm configurations in chess or whatever you can come up with which might be useful for the problem that you're trying to solve and i'll also show you an example where you can somehow find ways to kind of automatically pick reasonable features for some some problems so this is that example um this is called course coding and it's one way to come up with a feature vector it's a little bit related to that state aggregation that we talked about before but in this case we're not going to actually subdivide the space into disjoint chunks so what you see here in the picture is that we've actually subdivided a two-dimensional space for instance you could think of this as a location if you want to make it concrete i didn't make it very concrete here and we're going to subdivide this space but into overlapping regions and then whenever we're in such a region so we're here at this x say and then we're going to have a feature that is associated with each of these circles when you're in the circle the feature will be one and whenever you're outside of the circle it might be zero if you do that with this representation in this case three features will be equal to one and all the other features would be equal to zero so this is no longer a one-half representation as we discussed for the state aggregation or the tabular cases in this case it would be a few shot representation in some sense the features are still binary are either zero or they're one but it's some sense already a richer representation because as we can see here from the figure if you get these three um regions in some sense to light up we actually know we must be in the intersection of all three of them and that can be useful information but in addition to that we will still have some generalization if we're going to update the value associated with a state in this very darkest region over here that means that a different state say at y will actually also be updated a little bit because it shares one of the features with this state that is going to be updated so if we think of the linear function that is going to be applied to this it'll have a weight associated with each of the features so if we change the weight of that feature that means that any position that shares that feature where that feature is won will also experience a slight change in its value now i described the binary case for simplicity where you're either in a region or you're out and your your associated features either one or zero but of course that's not the only option you could instead also do something like over here where you have some sort of a distance function that might be for instance gaussian and instead of saying you're in you're in a circle or you're out we instead have a bunch of center points of these circles and we just measure how far you are from each and every one of them which means that in this case for instance we might have this feature light up a little bit uh well actually quite a bit because we're a little bit closer to its center and this one light up a little bit less because we're further away from the center maybe even the feature over here still lights up very brief very very lightly but very much less so than the features much closer now of course you would have to similar with the circles you would have to consider how far do you want your feature to be sensitive but it would be provide a way to be a little bit more continuous with your value functions whatever you do there basically the idea is then you have your representation and you expand this into a whole bunch of different features and then you can press that with your weights which are here to know that theta but you can think of this as being the same as the w from the previous slide into a single number why do we say expand here well in this case note that we had a two-dimensional representation we could have also maybe had say an x-y location which would constitute two two numbers and we could have fed that to the agent but maybe it's really hard for the agents to turn those two numbers with a linear function into a valid value estimate right the only thing you can do then is some sort of a linear linear function over the state space you can't have any non-linearity in terms of your location and how the value reacts to your location if you would only feed in the actual location numbers the x y coordinates so what we're doing here instead is we're turning those two numbers of this two-dimensional space into very many numbers as many as we've defined these circles so in some sense we're actually blowing up the representation in order to then be able to apply a linear function to that this is a common technique it's also used in for instance support vector machines doesn't matter if you don't know what those are but if you do then you might recognize this principle where we're basically just blowing up the feature space into something large in the hope that then the feature representation will be rich enough that a linear function suffices that if you then apply a linear function to these features you still have a good enough approximation now it might immediately be obvious that there are some limitations to this approach or if not then maybe let me point at some for instance it might be quite hard so if you only have two dimensions maybe you can define these circles and you're good to go but what if you have many many dimensions what if you have 10 dimensions or maybe even 100 dimensions and this is not very fictional because of course if we think of locations in the world maybe we'll think like two dimensions three dimensions maybe that's enough but as i mentioned for the helicopter example a helicopter might not just be in a location in three-dimensional space but it might also have other sensors it might have some audio sensors it might have wind sensors it might have other sorts of sensors it might be measuring i don't know air pressure or humidity so these would all be dimensions and it might actually be very hard to somehow figure out a good subdivision of that space without making the feature representations extremely large this is sometimes called the curse of dimensionality which is actually termed due to richard bellman also known from the bellman equations okay so this is just meant to be an example of how you could somewhat automatically construct these features because you could imagine if you do have a fairly small dimensional space like two dimensions that you could just sprinkle a couple of these circles in there you maybe don't have to think very carefully about where they are maybe you don't need to understand the domain that well for you to get a reasonable function for a reasonable feature representation but of course there is a choice here and here's a couple of examples so on the left hand side in a we see narrow generalization we've picked these circles to be fairly small and what this does is it allows you to be quite specific so if the value function is updated only in this region that lights up and then maybe mostly updated in the smallest region because of your representation that means that you have narrower generalization if you update this state states fairly near to it get updated as well but states further along or not and that's both good and bad this is a double edged sword if you have narrower generalization this means the narrower it becomes the more specific your value function can be and the more precise it can be if you need very like high resolution in some sense in your value function because the actual value function might for instance be quite bumpy or weird then it might be useful to have very narrow generalization but it does mean that learning might progress quite slowly because there's very little leaking from the state information from states further away so instead you could consider having broad generalization where the the overlap is much larger the circles are much larger then updating just a state in the middle would actually update a fairly large region of your state space and this can be useful because it might speed up learning but you lose out in terms of your maybe your resolution at the limit when you've learned your value function there's a limit to how precise you can make it at every situation of course depending on your dimensions here like we don't know what x and y are here in terms of dimensions it could be that you actually want to pick something more custom that the best representation would be asymmetric so this already shows it alludes to the fact that it can be quite tricky to pick these things in general the other thing i want to note is that we're actually aggregating multiple states and this means that the resulting feature vector and hence like what the agent observes in some sense will be non-markovian what i mean with that is that you're in this little region here in the middle and let's say the agent moves right but it only moves a little bit then you would end up in the same region the same features would light up so from time step to time step while you remain in a small little region in the middle the feature representation would not change while you're moving within that region and then at some point you actually step out of the region into the next one now this is why that's non-markovian because the time step at which that happens as far as the agent is concerned it can't tell right so there's actually some probability distribution happening here where the underlying state of course there is a certain moment which you actually transition let's say the let's say the effect of the actions are deterministic so when you move right you actually move right so maybe there's like a true transition which is completely deterministic but the agent doesn't know when it happens so what happens here instead is that as far as the agent is concerned so at some random point in time you'll transition from one region to the other this is the common case this non-markovian is the common case when using function approximation it's not specific to this it's just very easy to visualize in this representation but whenever you use function approximation including deep neural networks and so on you should count on the fact that in some sense as far as the algorithm is concerned there might be some partial observability because the function approximation will if it's a good function approximation it will in some sense generalize things so for the linear case specifically we do want to consider when good solutions even exist so it's good to kind of do this mental check so if you have if you're given a certain feature representation or if you're considering a certain feature representation if you're picking it yourself it's good to imagine well if i would have the best possible weights what would it look like is it good enough is my um generalization say narrower enough that i can have good enough resolution or is it broad enough that i can have fast enough learning so it's good to just think that through for your future representation and then sometimes you might catch things so oh there's no way you could learn this value function because it simply cannot be distinguish between these two different important states for instance neural networks tend to be much more flexible in that sense so if you just give pixels to the neural network it could itself figure out how to maybe subdivide the space in some sense right and come up with some sort of internal feature mapping um that is maybe more suitable to the actual problem that you're trying to solve because you could imagine that if you want to make these feature mappings flexible if you could make them flexible that this could be a lot stronger where maybe you have asymmetric generalization in some part of the state space broader generalization in another part and very narrow generalization in some parts where it's really important to get the values right a neural network can in some sense automatically find that okay now we're going to move on to linear model-free prediction so we're going to talk about how to approximate the value function with a linear function approximation we talked about this before but just to be very explicit we're going to approximate the value function by having this parameter vector w multiply with this feature vector x which we can write out very explicitly by just having a summation over all of the components of both of these vectors that's just a dot product and then we could define our objective to be as before a quadratic objective where we compare to the true value v pi for now let's just assume that we have that true value and we'll go we'll plug in real algorithms in a moment but for now we just consider that we have those regression targets and there's going to be some distribution over states this distribution is called d here and for instance it could be due to your policy or maybe you have some different way to sample the states then this algorithm can converge to the global optimum if we do stochastic gradient descent um if we would regress towards this true value function even if we use a stochastic update where the stochasticity here comes from picking the value function so you see the update here and the update rule in this case is somewhat simple because in the linear case the gradient of our value function will simply be our feature vector so that means that our update to the waves our stochastic gradient descent update will be step size times prediction error the true value minus our current prediction where the true value is v pi and our current prediction is vw times the feature vector the feature vector here is just the gradient of your value function in the linear case those are the same now we can't obviously update towards the true value function if we don't have that yet so instead we're going to substitute the targets this is similar to what we talked about before when we talked about prediction and for instance for monte carlo we could plug in the monte carlo return but of course we could also do temporal difference learning and then we could plug in the temporal difference target which is just in the one step case it's just the immediate reward plus the discounted value at the next state so this is now a placeholder for the true value of this state and we can use that to have data dependent updates where we don't have any hidden information anymore we don't have any privileged information about the true value of of the state of course we can also go in between these with tdt tdtd labda as discussed before where g lambda is a louder return which bootstraps a little bit on every step and continues a little bit on every step and then this lab that trades off between these two algorithms td and monte carlo as mentioned earlier we're not going to go into there now but in an earlier lecture we then turned this into editability traces where we could still have a causal algorithm that can every step can update rather than having to wait all the way until the end of the episode so this is just a reminder that td labla can be implemented to not have to wait all the way into the end of the episode and in fact so can monte carlo now return is an unbiased sample of the true value and therefore we can kind of view this as supervised learning where we just have a data set of states and returns and then the linear monte carlo policy evaluation algorithm which uses that return but then instead of the gradient has its feature vector will converge to the global optima this is known because we're basically trying to find the minimum of a quadratic function which is convex and then stochastic gradient descent with an appropriately decaying step size we'll find the optimal solution so that's nice under some mild assumptions one of the assumptions is that we sample iid but it turns out we can relax that assumption and it will also find the right solution if you don't sample fully iid as long again as your as your step size decays sufficiently and of course there's some other conditions as well just to mention one of those you can only find the global solution of course if your data supports that so one thing that you need for instance is that you can never make an irreversible decision right if there's just one state where you make a decision you can never go back that means that the information before going into that state won't have been visited infinitely often which means that convergence then for those states at least is off the table this is an assumption that's often called ergodicity there are similar other named assumptions as well but it basically means that your mdp is connected in the sense that you can always eventually at least return to states you visited before so this actually converges also when using nonlinear value function approximation although then it can convert to a local what we mean here is that if you have a non-linear value function your lost landscape might have multiple hills and valleys stochastic gradient descent on that lost landscape is still guaranteed to go down but it's a local algorithm so it could go down into some valley gets stuck there even though there's a lower value so a way to have lower loss somewhere else but it's still nice to have that convergence because it turns out we can't always guarantee a convergence and sometimes you might have parameters that diverge that go to infinity for instance we'll see an example of that later in this lecture so of course we can also consider td learning we know that the target here is a biased sample of the true value but we can apply the same trick where we consider this to kind of be training data where we have states and we have the associated targets for those states and we could update towards those in a very similar way so this is very similar to monte carlo this is basically just doing a regression in some sense but using the td error rather than the monte carlo so this now is again to a non-stationary regression problem because we're using the value function itself to construct the targets that's good to note that there's something that is different here from the monte carlo case and the target depends specifically on our parameter so it's not just even just any non-stationary regression problem it's one where we're actually updating the parameters ourselves and that makes the targets more stationary and that's something to keep in mind because this turns out to be important for the stability thief of these algorithms so now we're going to very briefly touch upon control with value function approximation this will be relatively straightforward because we've talked about these things before so just as a recap we're going to do some version of policy iteration in some sense and policy iteration depends on two steps one is policy evaluation and the other one is policy improvement to start with the latter for policy improvement we consider something fairly simple so if we estimate action value functions we can just use epsilon greedy policy improvement or greedy pulse improvement but then you wouldn't explore sufficiently but we can just consider something fairly simple there but then the approximation step is the difficult one if your state space is really large and there are therefore we're going to plug in value functions approximation for that step so we're basically just going to approximate the value function for the state action pairs in a very similar way as before and if we use a linear function we could for instance define our feature vectors now as dependent not just on state but also on action and then they have the shared parameter vector w which is applied to this feature vector and then gives you the suitable value estimate the update is as before this is extremely uh similar to what we had before except that everything now depends on states and actions rather than just states and in the linear case it turns out there's actually multiple things we could be doing so a different way to approximate the action value function is as follows where instead of defining a feature vector that depends on state in action we could have a feature vector that only depends on state now the action value function is now no longer a scalar function which is why i bolded it here so this is the difference between the previous slide let me go back to the previous slide first here we're doing the same thing as we were doing for value fun functions for state value functions where we now have a feature vector that depends on both state and action and then the q value is just a number right so we have our feature vector x which is the same size as our parameters w and the inner product will just give you a number now we're going to do something different where now we have a feature matrix w and we're going to multiply a feature vector that only depends on state with this feature matrix and the feature matrix is shaped in such a way that the output of this will have the same number of elements as we have actions a different way to write down the same thing would be as follows where we still have this scalar function for state and action and it's going to be defined as indexing this vector that we get from this first operation with the action so we have a matrix that has size number of actions by number of features so when we multiply that with the number of features we get a vector of size number of actions and then we just index in there to get the action value for this action that we're interested in this can also be written as follows where we basically have a separate parameter functional per action but we have a shared feature vector so let me again go back previously we had one shared parameter for all actions and we had separate features for each action now we have one shared feature vector and separate weights per action this is a slight difference but it might be important in some cases the updates are very similar to before so we can just update the parameters now associated with just the action that we're interested in and we can just not update the parameters of all the actions that we are not interested in because they correspond to different actions so b here is just any action that's different from the one that we're updating a and equivalently we can write this down as follows where i here is an indicator vector that indicates the action so it's a one-hot vector that has as many elements as there are actions and then only the element corresponding to the action that we've just selected has a has a value of one and all of the other elements has a value of zero and then we take the outer product with the state features and this will give us the update to our parameters now this might look a little bit complicated feel free to step through of this fairly like much slower than i'm doing right now but i also before you do that it might be good to recognize that actually a lot of these things can be automated there's a lot of software packages these days that use auto differentiation that allow you to basically compute derivatives without doing them by hand so i want to maybe make you aware of the differences between these methods because they are different these different ways to represent an action value function but then in terms of computing these updates we don't actually have to manually compute these outer products for instance by hand instead we can just call a gradient function from someone from auto differentiation package and use that examples of those include tensorflow pi torch and i like jax these days to do that okay so this raises a question should we use action in where the action is part of your feature representation or action out where we just have a state dependent feature representation and then just compute all of the action values for all of the different actions so the action in approach is when your feature vector depends on state in action you have the shared weight vector w the action out approaches when you have a weight matrix w and you have features that only depend on state one reuses the same weights across actions the other one reuses the same features or transactions it's unclear which one is better in general but they do have some certain other properties for instance if we have continuous actions it does seem easier to have an action in approach because it's you can't have an infinitely large matrix so the action out approach seems much more troublesome if you have continuous actions a different way you could deal with that is to parameterize your policy differently not have literally a separate value for every action if you have infinite actions that might be hard you can do that by having an action in network or there are other ways you could do that if you want to do something similar to an actual outdoor network but we won't go into that here i just wanted to to be aware of these choices essentially which are in some sense representation choices now we're picking our function class in a sense and even if we restrict ourselves to a linear function class it turns out there's already design design decisions that you have to make and these might matter for the performance of these algorithms for small discrete action spaces the most common approach these days is action out so for instance the dqn algorithm which i'll come back to at the end of this lecture that was used to play these atari games did this approach where there was a neural network a confident in this case which would take the pixels as input then go through a couple of layers of convolutions and then a couple of fully connected layers if you don't know what those terms mean that's okay but then in the end it would output a vector with the same number of elements as there are actions in atari there is up to 18 different actions so this is a fairly manageable size so we just output a vector of size 18 and the semantics of this vector would be that for each of the elements this would correspond to the value of that specific action for continuous actions it's much more common to have the action in approach where the action is somehow given as an input to the network so you have to represent it of course somehow so the network can easily use it but then you can do that okay now we talked of in the previous lecture about sarsa as td applied to state action pairs so it inherits the same properties what we're doing now is moving towards td algorithms right um but it's easier to do policy optimization with sources than this with td because we have these action values and therefore we can do policy iteration so kind of obviously we might want to do sarsa to learn these action values with function approximation and indeed you can do that for instance for this small example here it was done for this small example which is the game of well it's not really a game sort of like an example a benchmark of mountain car so what is it how does this work let me explain the cart here starts at the bottom or somewhere maybe randomly in this valley depending on which version of this environment you look into and the goal is to drive up the hill and the hills to its right there's only two actions or three depending on the version of this uh setting um and these actions correspond to going full force in one direction full force in the other direction or not doing anything at all and this is literally applying force so if you pick to go right that doesn't mean that you'll drive with a fixed velocity to the right no it means you're applying a certain acceleration to your car so there's a momentum here to the card and also of course if the card happens to be driving downhill it will speed up if it happens to be driving uphill it will slow down it's a very simple physics uh simulation in some sense and then the goal is to go up that mountain to the to your right but turns out the car is not strong enough to do that it's engine is not strong enough so what the actual optimal policy is is to go left first then start driving down and use the momentum from going down to then drive up the other hill and this is sometimes used because it's a somewhat tricky exploration problem if you don't know there's any goals to be had or any like rewards to be attained that are non-zero it could be quite hard for the car to figure out it should leave this ravine at the bottom and what we see here now is um a value function after a number of episodes and we see that the shape of that value function changes over time as we add more uh episodes the interesting thing that i wanted you to basically look at here is not just the evolution of this value function but also its shape at the end which is quite special it's not a very simple value function this is because um it depends where you are if you're far enough up the hill to the left then the optimal value would be to go the optimal choice would be to go down and then up the other end but if you happen to be slightly further down you're not high up in the uh eye up the hill enough yet then it could be better to um go up further and this also depends on your velocity if you're here and you have a certain velocity or let's pick the down like the bottom part of the valley if you're here and you happen to have no velocity then you should be going to the left because you should be driving up this hill first and then down and up the other but if you're here at the bottom but you do have a velocity at some point you have enough velocity that your optimal policy switches from going to the left to going to the right straight to the goal so you can see these discontinuities kind of showing up in the shape of this value function here where you can see these ridges and these are indeed the differences between am i already going fast enough that i should be going right and will i then reach the goal very quickly or am i too far away and should i be going to the other end and eventually i'll get there but my reward will be lower or my values will be lower because of discounting so this is a discounted problem and this is why you can see that the agent basically wants to go as quickly as possible alternatively not it's not just because of discounting you could also for instance give a reward of minus one on every step and you would see similar discontinuities as well so what was the representation here for this mountain car it's using tile coding and i'll briefly explain what tile coding is without going into too much detail basically what was happening here is there's these two inputs there's the velocity of the agent let me go back to this one and there's its position and this was basically cut up into small little tiles by discretizing them now what tile coding does so the simplest version of tile coding is just state aggregation you just cut up the whole space into little squares and you're done like this is this is your feature representation alternatively you could also cut it up into larger squares but then have multiple of those overlapping this is similar to what we talked about for course coding where now instead of being in one of these squares you could be in multiple squares at the same time one such discretization of the sage space we could call a tiling of the state space and then tile coding could apply multiple tilings that overlap but they don't exactly overlap so they're kind of offset from each other so that you could be in one tile from one tiling but in a different tile from a different tiling and if you move slightly you transition from one to the other tile in one of the tilings but not in the other necessary this is very similar to the circles we saw before for course coding so it's basically the same idea but with squares rather than circles and then if you have that you have some sort of a feature mapping you can apply a linear function approximation to this and this was applying linear success to that and here's a plot showing the performance of the algorithm where the y-axis is the steps per episode so this is not your reward but it's very related to the reward except that lower is better fewer steps per episode means you're doing better so this is a measure of performance where low is better and this is averaged over the first 50 episodes and then also over 100 runs to get rid of some of the variance in the results and we can see here an n-step approach where there is an n of one or two or three or well three is not there but four or eight and we can see something similar to what we saw before where depending on the step size and the the uh the end step in your return you get a slightly different profile where larger ends tend to tend to prefer smaller step sizes but then for intermediate ends you you get the best performance we saw this before for pure prediction it turns out to also work for control that these intermediate values tend to work uh well especially if you're interested in a limited amount of data note that the x axis is actually our step size times the number of tilings in this case eight so this is a common thing because what we often do if we do something like tile coding note that all of your features are binary right they're either one or zero because you're either in a tile or you're not so the number of features that are equal to one in your feature representation is equal to the number of tilings and then it turns out it's convenient to basically pick your step size in such a way that takes it into account because the magnitude in some sense of your feature vector now depends on the number of tilings this is more generally a useful thing to keep in mind that the magnitude of the features might matter and if you're just doing the std algorithm you might actually want to adapt your step size to take it into account for instance if you know the average magnitude of your features maybe you want to somehow adapt your step size to be lower if that magnitude is higher or vice versa alternatively a similar but opposite approach would be to normalize the feature vectors maybe you don't want to do that on a case-by-case basis because that might change the semantics of a feature vector but you could have some sort of a global normalization which basically means that the average feature vector for instance it has unit length and that might make it easier to pick your step size if you don't do this maybe somewhat obvious in hindsight if you just consider the same algorithm but you don't take into account the feature vector then you could consider the exact same algorithm but with different features where the only difference between the features is that there's some constant scaling and then turns out the performance is different if you don't take it into account by picking your step size so that's like this little subtlety that's good to be aware of in practice what people often do is they tune the step size to get good performance for instance by just running at a little little bit of time or running it on some similar problem if you're doing something uh where you need to do well immediately on the real problem but often we do these things in simulator so you can run this multiple times and you could just find out what good values are and then it's very useful to plot these plots like we have here which is called a parameter study we're basically just not looking at what's the best performance that i could tune but we're actually looking at how sensitive are these algorithms two different step sizes n-step parameters and so on okay that concludes that example now we're going to talk more generally a little bit about conversions and divergence of these algorithms so when do these algorithms converge what do we mean with convergence we mean that they find a solution reliably that they find some sort of a minimum of something so do they converse when you bootstrap when you use function approximation when you learn off policy and turns out it's a little bit subtle it's not always the case ideally we want algorithms that converge in all of those cases or alternatively we want to understand when the algorithms converge and when they do not so we can avoid cases where they do not and we'll talk about this at some length now so let's start with the simplest case which is the monte carlo setting i talked about this already i mentioned that this is a this is like a sound algorithm it will converge so the monte carlo algorithm if we're not considering the stochastic case we're just going to consider like the the loss there first you can see there an eq equation which has an argument over something but let's first focus on the on the thing inside there there's an expectation over the policy this is similar to what we had before instead of having a d there i now put pi there to basically illustrate that the distribution of states but also the return depends on your policy in this case and then we consider this squared loss which is our monte carlo loss essentially and we want to find the parameters that minimize the squared error we're going to call that wmc w for monte carlo and i'm arguing here i'm basically just stating that the solution to this will be this equation there on the right this is a linear like this basically linearly square solution so if you've seen that before this would look very familiar and what we see here that is essentially it's the expectation of the outer product of your features and that expectation is then inverted this is a matrix so this is an inverse matrix times the expectation of the return times your features and that turns out to be the solution that monte carlo finds we can verify this so let's just step through that when when are we done like when is the algorithm converge well it's a gradient-based algorithm so we are converged when the gradient is zero because then or i should say if the expected gradient is zero because we're doing stochastic gradient descent because if the expected gradient is zero of our uh loss essentially the thing that we're optimizing that means that in expectation we're no longer changing the parameter vector so when this is the case this is then we have reached what is called a fixed point so we can just state okay well we're going to be at some sort of a fixed point let's assume that we don't know what wnc is yet we're going to derive that here so we're just going to take the gradient with respect to those parameters and we're going to say well that's going to be equal to zero or at the fixed point so then we can just write that thing out so first thing to do is to write out the value estimate explicitly as this linear function that it is and then we're going to move things around so first we're going to pull the feature vector outside of the brackets inside we notice that this inner product times a feature vector can also be written by putting the feature vector in front of this inner product because this is just a number right so we can basically move this one around but then we can consider this to also be kind of like an outer product times your weight factor instead because the order of uh products here doesn't matter right we could put brackets around the inner product here or we could bracket brackets around the outer product here that's the same thing the otherwise the x just gets moved next to the return so we've just pulled that feature vector inside the brackets that's all then we rearrange and we basically know that this expectation is just one expectation minus a different expectation so let's move the minus part to the other side and then we see that this should be equal to that when the gradient is zero now all that's left to do is to take that inverse of that matrix on the left hand side which therefore is assumed to exist otherwise this step is not valid and then we get our solution so when is that step valid well so there's a distribution over states there right and there's an inner sorry outer product of feature vectors so when is this typically valid it's typically valid if your state space is large and your feature vector is small that's like one case in which it's valid because then assuming at least that your features are non-degenerate so you have different features for different states then if there's at least as many different feature vectors that you can encounter as there are numbers in your feature vector as there are like components in your feature vector then this inner product typically exists so this is the common case essentially like typically you will have many more states and you will have features that do change when you go to different states and then this sorry this outer product the expectation of that will typically be invertible but it is a technical assumption otherwise uh you're not guaranteed to converge it might actually still be that the algorithm still converges but it might not converge to a point it might instead convert to a line or a plane if that's if you can't invert that matrix it depends a little bit on the technical details so the agent state here does not have to be mark off this is fully defined in terms of the data that's kind of interesting right so we don't have to have a markovian feature vector or anything like that basically the fixed point here is defined fully in terms of the x t's it's not even defined in terms of the agent state it's defined in terms of the features that you observe so that's kind of interesting we didn't need that property now we can maybe see if we can do the same for td so let's consider the td fixed point so i'm going to argue here that this converges to this different thing look at it for a second and i'll go back to the previous slide so the monte carlo thing was this outer product the expectation thereof inverted times the return times your feature this looks quite a bit different there's a different outer product here this is still an outer product but instead of being the other product of the feature with itself it's the other product of the feature with itself minus the next feature that matrix is inverted if it exists under similar assumptions as i just mentioned before it would exist it typically exists but doesn't have to and then times the reward times your feature factor not the return but the reward so this is a different fixed point it doesn't just look different it typically actually is different from the monte carlo fixed point so now we're going to verify again as we did from the monte carlo case that this is indeed the fixed point so we're going to consider the td update and we're going to assume that the expected update is 0 where the expected update is given as this this is similar to what we're doing from monte carlo so we're basically just considering the update we're saying oh what is this update zero and then we're just going to manipulate it in a similar way here we're again going to pull in this feature vector first and we put it next to the reward on one end the step size here is actually not that important though i'll get rid of rid of it in a moment so we're going to put it next to the reward and we're also going to put it next to these two scales but we're going to put it on the left hand side which we can just do because these are just numbers and then having put that on the left hand side we notice that on the right hand side we have these wtd so we can also pull that outside of the brackets which means we are left with this gamma times x t plus 1 minus x t both of which are transposed because they were multiplying with this weight vector right okay so now we can again rearrange things so let's pull this thing to the other side so there's a plus here so to put it to the other side it must become a minus but we actually fold that minus inside this little term over here so note that the order has changed that's because there used to be a minus in front and we just pushed it inside and instead flipped the order in addition let's just pull out the transpose here let's pull out the transpose outside of the brackets so we just first subtract one vector from the other and then transpose it to come up with this other product and now finally we move this part to the other side by multiplying with its inverse again assuming that that exists and we notice that because the the learning rate is inside the expectation and we're inverting for simplicity for instance consider here a constant step size just for argument's sake but you can get rid of it in other ways we can see that this step size will actually cancel out with that one because moving this one to the other side by multiplying with the inverse kind of means we're multiplying with one over the step size which cancels out with this step size so we get the fixed fouring becomes this one so this is kind of just verifying that the fixed point is indeed the one that is listed above and we have like substantial proof these days that this is indeed where td converges if it converges it's important to note that this differs from the monte carlo solution and typically the monte carlo solution is actually the one that you want why well because the monte carlo solution is defined as minimizing that thing that we actually want right this is actually minimizing the difference between the return and the value this is sometimes called the value error or the return error and this is an unbiased the gradient of this is unbiased with respect to the using the true value instead of the return as well but actually one could also argue that we don't actually even care about a true value we just care about having a small error with respect to the random returns that's also valid so this seems to be exactly what we want in some sense and then if td converts it somewhere else this might be undesirable in some sense and this i think in general is quite true that monte carlo typically converges to a better value than temporal difference earning does when you start using function approximation however temporal difference learning often converges faster and of course we can trade off again we can pick an intermediate labda or an intermediate n and kind of get the best of both worlds you could even consider changing this over time you could start with a small end and then slowly increment it over time and in the end in the limit maybe still come up with the monte carlo solution if you draw if you'd like um so there's a bit of a trade-off here by it's variance trade-off again similar to what we had before but now td is biased asymptotically as well in the tabular case this is not the case in the tabular case both monte carlo and td find the same solution but if we add function approximation even just for linear function approximation td remains biased which means that it actually finds a different solution in the limit so it does converge if it converges to this fixed point and we can actually characterize it a little bit farther by considering the true value error which is defined as this difference between the true value and our current estimate weighted according to this distribution over states where here we use d pi to denote that this distribution depends on our policy and then monte carlo actually minimizes this value error completely but td doesn't but we can say something about where td what td reaches and we can say that the true value error of td is bounded in the following way where it's small or equal to one divided by one minus gamma times the value error of monte carlo which is equal to the minimum value area that you can get under any weights so what is this saying well this is basically saying that td is doing maybe a little bit worse asymptotically than monte carlo it's doing um something different right it's biased but we can bound how bad it is now how bad is that well there's this term one divided by 1 minus gamma how big is that well this depends on gamma obviously and there's this useful heuristic that we briefly talked about earlier in the course as well where you can kind of view this as a horizon where if gamma is 0.9 for instance then 1 minus gamma is 0.1 and then 1 divided by 0.1 is 10. so if your gamma point 9 roughly corresponds to a horizon of 10 so that means that in that case if your gamma is 0.9 the monte carlo error can be 10 times smaller than the td error but it looks right it can be at most 10 times smaller because this says that the the error that td reaches is smaller or equal to this thing right so it could be of course in some cases you can have feature representations for instance for which td actually happens to find the same solution as monte carlo this is just saying in the worst case if for instance your gamma your discount factor is 0.9 in the worst case it could be 10 times larger doesn't mean it necessarily is 10 times larger now you may have noticed that i already said a couple of times if it converges so there's a slight well but potentially big problem here there's a fundamental problem and it's a really interesting one and it's related to the following fact that the temporal difference earning update is not a true gradient update and this is due to this it's highlighted a lot lightly in the slide here to this bootstrapping because the bootstrapping uses the same parameters as the parameters that we're updating with this update but we're kind of ignoring that in the update we basically just plug that in as a proxy for the real return and hope for the best in some sense but i'll show you an example where this can go wrong normally this is okay it's okay when i say this is okay i mean it's okay that it's not a gradient update sometimes people people like things to be gradients because we understand gradients quite well we know stochastic gradient descent converges so if we have something that's not a gradient it sometimes is of some concern so it's good to appreciate that there's this broader class of algorithms called stochastic approximation algorithms which include gradient algorithms stochastic gradient descent as a special case but that's not the only special case that necessarily converges and indeed temporal difference learning can converge to its fixed point so stochastic approximation algorithms are a broader class and there are specific conditions under which these do converge stochastic gradient always converges under fairly light assumptions such as that the noise is bounded the step size decays and your data is stationary and so on but temporal difference learning does not and now i'm going to show you an example in which it doesn't so this is an example of divergence of temporal difference learning where we have two states and our feature vector has exactly one element and this feature vector has an element equal to one in the first state and the same element is equal to two in the second state what this means is it is depicted above the states is our value estimate in the first state is just w because it's it's just this number w which is just also a single number because our feature factors also also only has one element right and we're doing linear function approximation here so our value in the first state is w and the value in the second state is 2w because it's just the same number w times your feature which is two the reward along the way is zero it's a simple problem but now we're going to consider it's it's a partial problem right i didn't say what happens after you reach uh states where the feature vector is two i just go i'm just going to consider this one transition and then let's see what happens if we apply td so let's step through that let's just first write down the td equation this is our update for the for the temporal difference sorry temporal disorder update for our parameters of the value function where the value function gradient in the linear case is just our feature and in this case the reward is zero the value of the subsequent state s prime is the value of that state on the right which is 2w for whatever w is currently and the value that we're updating so the value of the state that we came from is just x times w where x is 1 so that will just be w so we've just filled in these numbers here this 0 goes over here this 2w goes over here or maybe more precisely this 2 goes over there and then the w comes in because we want the value estimates and then this one goes over here so there's 2w minus 1w with a discount factor as well and we can slightly rewrite this by pulling out the w and getting rid of that 0 which we don't really need to consider so then it turns out that this update looks as follows where we have our previous weight estimates plus your learning rate whatever that is times two minus discount minus one times whatever your feature vector is sorry whatever your weight factor is your weight number which only has one number here but what does this mean well let's consider that your weight is positive let's say you initialize it at one and let's consider a discount factor which is larger than a half if the discount factor is larger than a half this term within the brackets will be positive if your weight is positive and this term within the bracket is positive and again the weight here is positive all of this will be positive and in fact this whole term here on the right will be added to the previous weight which means our new weight will be larger than the previous weight but the weight was already positive so we're moving away from zero note the true value function here would be zero like the true value of that first state might be zero at least it's consistent with this with this transition for it to be zero you can expand this example to actually have a fixed point of zero but in this case it will just continue to grow and your weight in the limit will actually reach infinity what's actually even worse it tends to grow faster and faster because your weight keeps going up and this term multiplies with your weight so this is this is an example of a problematic behavior of temporal difference learning and now we can dig into why this has happened when this has happened how can we characterize this and how can we maybe avoid this so what's happening here is that we're actually combining multiple different things at the same time we're combining bootstrapping we're bootstrapping on that state value on the right we're combining off policy learning and i'll explain that a little bit more and we're combining function approximation and when you combine these three things together you can get these divergent dynamics this is sometimes called the deadly triad because you need all three for this to matter for this to become a problem the off policy learning i promised i would say something about that why did i say this is off policy there's there's no policies here there's no actions here so what's happening here well there's a certain dynamics and your dynamics would give you get you from this one state to the other state but then what we're actually doing is we're going to resample that transition over and over and we're never going to sample any transitions that come from the states that we've entered just now right and that is of policy in the sense that they're actually in this case there are no policies that would sample the states over and over again specifically that way so what we're doing instead we have an off policy distribution of our updates this can emerge why this is called off policy is this can emerge if you're trying to predict the value of a certain policy while following a different policy for instance it could be that the value from this state that there are actions but we're just not interested in their values and hence they get kind of zeroed out in your update in some sense now we could also consider on policy updating by extending the example to be a full example and now we consider this second transition as well and let's for instance assume that this transition just terminates with a reward of zero and then maybe you get transitioned back to this first stage so maybe you just see this episode over and over and over again but now we sample on policy which means we actually sample the transition going from the second state as often as we sample the transition going to that state if that is the case turns out the the learning dynamics in general are convergent td learning will converge also with linear function approximation if you're updating on policy i will show that here just for the example where we can just write out maybe the combined update for simplicity so let's just consider this transition and let's immediately consider that other transition and let's just write down both of those updates where the second transition bootstraps on the terminal state with value zero also has a reward of zero along the way so this will then basically only retain that value of that state and if we merge these together we get this update to our weights and note that the term inside here is now negative for any discount factor uh smaller or equal to one which means that if your weight happened to be positive it would push it down if your weight happens to be negative it will push it up and that is exactly what we want in this case because the optimal weight here is zero just in this example right not generally the case but in this example it's zero and we can see that the dynamics are such that indeed if you're larger than zero you'll get pushed towards zero if you're below zero you will also get pushed towards zero so this seems to be working and indeed it has been proven in general that on policy temporal difference voting with linear function approximation will converge to the optimal policy that's already the optimal values in terms of its fixed point right so the fixed point of td is still different from monte carlo but if we consider that fixed point to be the goal it will find that fixed point in the on policy case as i mentioned the multiplier here is negative for any discount factor between zero and one okay so off policy is a problem on policy fixes it let's dig into one of the other aspects i also mentioned function approximation as one aspect of this so let's see what happens if we don't function approximation so we can consider a tabular representation where in each state we do have a feature vector i've basically just denoted it as a feature vector but it's one hot picking out exactly that state if we do that this is just regression and this will also converge the answer might be suboptimal but no divergence occurs so again td might have a different fixed point in the tablet case it doesn't actually have a different fixed point in general it can have a different fixed point but we could by playing with a representation we could go from divergent dynamics to conversion dynamics that's the point here tabular being a special case so we can still updates of policy now importantly right we don't need to update on policy so we can go back to just updating the first state if we would do that then the value of that state will convert to whatever the discounted value of the next state is and we would never update that value so this is a different way in which you can have a different answer because we're considering off policy learning of course with function approximations can also interact but even in the tabular case it could be that because you're learning off policy that you simply ignore certain states and therefore never update their values which might have lingering effects in the tabular values of the other states so that's just like something to maybe be a little bit aware of that this can still affect where your solution is but you won't diverge you'll still convert to something so that works too we had off policing is on policiness we had function approximation or not and now let's go to the third one what if we don't bootstrap or what if we bootstrap less because i said there were three off policy learning function approximation and bootstrapping now let's look at the bootstrapping and one way to do bootstrapping is to use multi-step returns so let's consider now to have a lab return we're still considering off policy updates right so we're only sampling this one transition over here in some sense but instead of calling it just sampling that one transition let's say it a bit more precisely we're only updating this first state we're never actually updating the second state but now let's consider taking a ladder return from that state instead of taking a one step return this is again something that can happen in some sense in the wild enough policy learning and then we can just write that out we have the laba return here which we know bootstraps slightly on that first state and also slightly continues but it continues to v s double prime which in this case is this terminal state so v prime sorry s prime here is this second state and s double prime is then this terminal state so now we can fit in all these numbers that we know we know that the reward on every step is zero so r is r prime is zero and also the value of this state after two steps will be zero we can just fill in all the numbers that we know and we get this thing and now we can analyze that thing so it's very similar to what we had before but before labda was basically zero so before when we were talking about the divergent example we had two gamma minus one now we have two gamma times one minus lambda minus one ah we have a parameter that we can play with we can play with this lava parameter turns out we can always make sure that this multiplier is negative which we need here for conversions if we pick the labda parameter in a specific way so what we want is essentially that this whole term is negative which means that we want this term to be smaller than one then we just rearrange move things around and it turns out the condition for that will be that your ladder is larger than one minus one over two gamma just to be clear this is not a generic finding this is just specifically for this example this this relationship we're just going through the example so for instance we could pick a specific discount we could fix it discount being 0.9 then turns out the lab the parameter that we need to ensure that we have convergence is that you need to pick it larger than four over nine you can just figure that out by plugging in zero point nine here and then just doing the math and figuring out what to discount sorry what the trace parameter lab that should be this is not a hugely large trace parameter typically people pick things like 0.9 or 0.8 or something like that so 0.45 is not an unreasonable value to have if you pick it larger than that you're good to go and you won't diverge so these are important things to keep in mind and one thing that i want to highlight by going through all of these things is that it's actually it's not always binary so the deadly tribe might make it sound that if you combine bootstrapping function approximation and off policy means that you're always in trouble that's not actually true you can combine bootstrapping function approximation and of policy learning as long as you're not bootstrapping too much you're not too off policy and your function approximation is tabular enough in some sense so what turns out to be the case is that people run these algorithms in practice quite a lot and a lot of deeper algorithms actually combine all these three things they combine bootstrapping of policy learning and function approximation where the function approximation is for instance deep neural networks you have policies for instance because we try to predict many different policies or because we're trying to learn about past data and things like that or because we're just doing q learning at scale which is also an off policy algorithm and then the bootstrapping is also because fringe is doing q learning or doing something similar but also because of this bias variance tradeoff that we typically find that it's better not to bootstrap too heavily but also not to use full monte carlo returns so we're combining all of these but turns out many of these algorithms are actually stable and they do learn well because we're not we're still in the triad but we're not in the deadly portion of the triad an alternative that you could consider is to pick a different loss so let me briefly go into there so temporal difference learning has a specific update right and i mentioned all temporal difference learning can diverge if you're doing uh all of these things at the same time but why do temporal difference learning can we find something else maybe okay so the problem as mentioned from temporal difference learning is that it ignores essentially that we're bootstrapping on a value that itself depends on the weights what was happening in the example that we had just now going from a state with value w to state with 2w was essentially that we're trying to update this state closer to the value of that next state but in doing so we're actually updating the value of the next state more away from us so you're chasing your own tail except your tail is running away from you faster than you're running towards it that's why td in that case diverges so an alternative would be to actually define something that is a gradient a true gradient one example of this is the bellman residual gradient algorithm where we define a loss which is just our square td error so this is a different loss right we don't have our normal value loss here we don't have a normal value error we instead say well let's just consider the temporal difference error and let's see if we can minimize that if you push the gradient through that and you calculate that you find something that looks very similar to d but had to td but has a different additional term unfortunately this tends to work worse in practice and one reason for that is it smooths your value estimates where stem for difference learning in some sense predicts so what do i mean there intuitively temporal difference learning updates the value of the state towards something that is in the future towards this one-step estimate of your value in the future the reward and the value the next state this algorithm instead it's a fine algorithm it can be used and it's convergent in as much as sample difference learning sorry as stochastic gradient descent is convergent but it does something that is maybe a little bit weird in the sense that it updates not just the value of the state that we came from but it also update the value of the state that we're bootstrapping on so in that sense it's a smoother because what it might do is actually it might pull downs for instance the value of the state you're bootstrapping on just to make it easier to predict it but that doesn't mean that's a valid value for that state so it turns out if you run these algorithms in practice oftentimes it just works worse that's maybe not a necessity maybe it's not necessarily worse but it's basically seems to be that this loss although we can define a gradient of that loss which looks somewhat similar to the td algorithm that maybe the loss is just not the right thing to be minimizing it might not be the right thing to look at so smooth values are not just bad because your value accuracy might be slightly wrong but it might also actually lead to sub-optimal decisions this is especially true if your rewards are for instance sparse because then there might be this one rewarding event and immediately after getting the reward your value function maybe should drop because you've already received that reward but if you do something like bellman residual minimization what might actually happen is it tends to smooth over that discontinuity in the value function and hence the states immediately after already taking the reward might be updated to look a little bit like they might still be able to take some of the reward which might lead to wrong decisions so let's consider a different alternative let's instead of squaring the td error inside the expectation let's just take the expected td error and try to minimize that if we minimize the expected td error maybe we're still kind of good to go from a prediction perspective and instead of so so let me just go back a slide let me show the one difference here is where this query is right so here in the bellman residual algorithm the square is inside the expectation here when we consider the bellman era these squares outside of the expectation again we can take the gradient of this but there's a catch and the gradient looks very similar to the previous algorithm but there's this weird s prime there s prime is a second independent sample of s t plus one so s prime t plus one needs to be sampled separately if you don't sample it separately you're actually doing this other algorithm that we talked about before where st plus one is the actual next state that you've seen so this algorithm can only be applied if you have a simulator that you can reset you should be able to sample the transition twice essentially and then you could apply it but that's an unrealistic assumption for many cases if you're a robot in the real world and you're just interacting with the world you can't resample the transition twice so then this uh this algorithm becomes less feasible here's just a brief summary of what we were just talking about there's a couple of different categories so you have your algorithm which could be monte carlo or it could be td you could be on policy or it could be off policy and you could pick a certain function class now in terms of just summarizing the convergence properties of these algorithms if you're on policy your message is good to go the only thing that's not 100 guaranteed is non-linear conversions but even for that one we can say some things actually and in some cases you can guarantee that td converges for the non-linear case but it requires more assumptions so we can't say in general it always converges if we go off policy we see that linear function approximation also becomes problematic for td although again i want to stress that this doesn't mean it's always a problem it just means there are cases in which we can't guarantee convergence doesn't mean the algorithm doesn't work in general just means that there are educators in which you could do something wrong this has been addressed in more recent work that is too detailed to go into in this lecture where people have devised new algorithms that are slightly different from td but might be inspired by td that are actually guaranteed to converge also with function approximation also under off policy updating but those algorithms are beyond the scope of this this lecture and this course section okay so now let's move towards control so now that i've maybe expressed some uncertainty about convergence but also some evidence that typically does converge so we can extend all of these control algorithms obviously to function approximation the theory of control with function approximation is not that well developed unfortunately this is because it's substantially harder from a prediction case because now the uh the policies under direct control this might change over time and it might be quite hard to uh reason through maybe the one exception of this will be in a subsequent lecture when we talk about policy gradients which are again stochastic gradient algorithms and therefore are in some sense fine but the theory of control with function approximation when you use value estimates is much harder in some sense in addition to that actually we might not even want to converge that often or during learning especially when considering this value learning because you might actually want to continue tracking so for instance if you're if you're doing something like value-based control you your policy will be changing so your predictions shouldn't converge because they want to convert to whatever the current value of the policy is but if the policy keeps changing you actually want to track that rather than converge but also more generally even if we're doing direct updates to the policy it might be preferable to actually not converge but just to continue to adapt that doesn't mean that convergence guarantees are not interesting or not important because one thing that we could still have is if we have a guarantee of convergence that means that the algorithm kind of updates stably typically throughout its lifetime whereas if you can diverge you only need to diverge one once for everything to fall apart obviously so understanding when things converge and diverge is still very important even if you can't fully characterize it okay now moving on we're going to consider updating these functions in batch settings so we've talked about batch methods previously but there's a different reason we're going to talk about it now previously when we talked about that's reinforcement learning this was basically to highlight differences between temporal difference learning and monte carlo learning but we weren't proposing it as an approach that you should necessarily follow although we didn't mention that you could do that now we're going to take the other view which is oh what if we really want to learn um with batches of data and a reason why you might want to do that is because well gradient descent is a simple and appealing algorithm it's not very simply efficient the reason why intuitively is that every time you see a transition you're going to consume that transition and make maybe an incremental update to your value function and then you're going to toss the transition away but there might be a lot that you could have learned from that transition that you can't immediately extract it doesn't play a role immediately in the update that you're doing and then that would be a waste in a sense so instead we could consider to doing a batch approach where we try to find the best fitting value function for a set of data that we've seen so far so instead of just doing one gradient update necessarily on every sample we could try to maybe extract as much information as we possibly can from the training data there are several ways to do that i just want to mention that if if this seems at odds with our main goal at the beginning where the world is large your lifetime is large and consider a robot walking through the world maybe you're thinking well that robots can't store all the data this is true i and i agree with that viewpoint so full batch settings are maybe most typically employed when the data is not too huge but turns out you can also employ them in practice by for instance storing only a limited amount of data or alternatively this will actually also be related to model-based approaches where you can consider storing some of the data to be quite similar to storing an empirical model so now it's just a choice of what your model is what the structure of your model is is it a parametric model or is it a non-parametric model that consists only of data that you've seen so there are approaches that feel quite similar to these batch approaches that do scale to really large problems where maybe you couldn't store all the data but for now for simplicity and clarity we're going to focus on this case where we really are going to consider all the data and we will actually see that there are some methods that don't actually need to store all the data to get the same answer and we're going to start there by considering again the linear case so we have linear function approximation and we were considering fitting the best possible fit of our parameters so we talked about this before in the limits for td so let's consider that td update again and what we said is oh if you consider the updates to have converged so that the expected update is zero then your fixed point will be this this weight vector which we called wtd which is as defined on the slide this is the same as we saw earlier now one idea that you could do is well what if we take that expectation and instead of taking the expectation over all possible states and all possible transitions that could happen let's just take it over the things that did happen so what we'll do here are add some time step t and we're just going to instead of taking the full expectation which we don't know for this we would need the uh knowledge of the mark of decision process instead what we're going to do we're going to take the average of the things that did actually happen so this is very similar and in the limit of course this average if time grows to infinity if your distribution is stationary which needs to be the case for this expectation to even exist then this summation this equation actually will converge to this equation above by law of large numbers assuming of course that the variance of your reward is finite and so on so the idea then is to maybe instead of doing this where we just use this for analysis we are going to do exactly the same things on the empirical loss and that turns out this will look as follows where we see something very similar as we did for the td fixed point so for the td fixed point let me walk you through that first we had the expectation over an outer product of two vectors so the first vector is just a feature vector and the second vector is your feature vector minus the next feature vector discounted and this is an outer product so this is a matrix and the expectation over that will be a full rank matrix typically under some mild assumptions so we can invert it and then we multiply that with a vector so we have a matrix number of features by number of features times a vector number of features and that will give us a weight vector this is also the number of features large which fits with the linear function approximation of course now here we're going to do exactly the same thing but instead of having the expectation we have summations note that we're not not taking averages because if we would put like a one divided by t here and we would put one over here they would just cancel out so we can ignore that we can just consider the sum of these other products and the sum of these vectors over here and that will be equivalent to taking the average instead so the summation is over exactly what you would expect now which is this outer product of these feature vectors but it's not for the expected feature vectors it's for the ones that you've actually seen and then again here the same thing happens this is called least squares temporal difference learning and the solution here this weight vector is the least square solution given some data so we can put that equation here at the top of the slide where i made one change here in the previous slide i called it wlstd and now i'm just going to call it wt because in fact we can change this as we get more data wt can change because it's just a data dependent weight and indeed we could be interested in these predictions for uh any time set during learning right so there might be reasons why you might want to extract this prediction for instance for reasons of control you might want to pick actions although that's a bit more subtle to do to combine with least question for difference learning because in some sense you don't want to consider all the data then because these this is collected with overpass policies so for simplicity now consider the prediction case but it could be that you still want to know these value functions as you go so you might want to have this weight vector you might want to compute it you get some new data you want to recompute it now unfortunately we can do that online so before we do that let's just give these things names we call the matrix here inside the brackets a so that this whole thing will be a inverted so a t is just this summation of outer products bt will be the summation of the feature vector times the reward so these are the names we give it and then the weight vector at every time step is defined as the inverse of a t times bt now we can update both of these online so we're considering actually storing both these separate quantities and we're going to update a separately from bt and then recompute wt whenever we need it one way to do that the naive approach and maybe the most obvious way to do that would be to update the a matrix by just adding the current outer product the one you've just observed and adding the reward times feature vector to the b vector unfortunately the reason i call this the naive approach is that it's fairly expensive it's not expensive because of the b update that one's very cheap because that one just has the same number of operations as you have features x here is just a feature vector b is the same size as a feature vector we're just adding two vectors together and this gives us our new vector the update to a is more expensive because we're adding an outer product which is will be our other products of these feature vectors will be the same size as a matrix by features by number of features and we're adding this to this a matrix this is also the size of the feature squared so adding these together and giving us our new a matrix is n squared where n is the number of features so that is also not the most expensive operation here the most expensive operation is then that when we want to compute w we need to invert a and inverting a naively will give us a cubic cost because a will be some dense matrix and inverting a square matrix of n by n will be cubic in n so that's expensive and we want to avoid that and the reason we want to avoid that is because we want to ultimately consider large feature vectors so if you only have 10 features you could apply this approach right then the algorithm would cost like 100 computations to compute your weight vector sorry a thousand computations updating b would only be like 10 operations order 10 operations updating a would be order 100 operations but an inverting a would be order a thousand operations that's still okay but if you have a million features this becomes very uh very wild very quickly and it's not very very scalable approach um so we'd like to do something a little bit cheaper turns out you can do exactly the same thing cheaper by updating the inverse of the matrix a instead of the matrix a itself and this is called a sherman morrison update and sherman morrison is a more generic approach where we can take any matrix and then add any outer products and then consider the matrix that results from that operation and then turns out the inverse of that matrix can be computed in only squared number of operations if you knew the inverse of the matrix that we had before so we just we we always have this because we're incrementally keeping we keep them updating it so we just start for instance at the um you can start for instance at um something simple like the identity and then you could just incrementally keep on updating like this so the operations here are um as you can see a matrix by another matrix here's an outer product by another matrix of course you could also first compute the dot product of the vector with the matrix turning that into a vector and then you still have the same happening here so then you'd still have an outer product left the thing below the division bar is just a scalar because it's a vector times a matrix times a vector plus one so this is a it looks maybe a little bit complicated there are many proofs of why sherman morrison works online feel free to dig into that if you didn't see this before the update to b remains the same we don't need to invert b it's just a vector you can't invert that and this way we would be able to compute this w this weight vector at the top of the slide by incrementally updating the inverse of a rather than a itself this is still quadratic in the number of features and it applies only to linear function approximation but it is it is an approach that scales at least to some larger numbers than the cubic approach it is more computed than td so in large applications this is still avoided for two reasons one is that it's limited to the linear case as i mentioned and the other one is that you might still want to have very large feature vectors even if you want to do linear and if your feature vectors are a million size a million then still squaring that is still fairly large so it might still take quite a bit of time on every step however it could be that this is feasible for your application and then it's it's good to keep keep it in mind the other reason why we talk about this is because it can be an inspiration to other approaches so in the limits it's good to appreciate that lcd and td converts to the same fixed point for lcd this is almost immediate let me just go back a couple of slides to show that because as i said by the law of large numbers this thing will just become that thing which means that the solutions will also become the same so lcd kind of immediately will converge to the same fixed point std under mild assumptions and we can extend lcd to multi-step returns as well we could consider lstd lab that instead of just considering the one-step lcd video we discussed just now it can also be extended to action values maybe quite naturally you consider you can always almost always extend anything that's defined for state values to state action values by considering sarsa-like approaches and then of course we can interlace it with policy improvement if we do that we could have least squares policy iteration where we just use lstdq to estimate action values for the policy evaluation phase and then we greetify the algorithm of course when we do that when we change the algorithm we need to be a little bit careful that we then shouldn't use all the data collected so far for computing our value estimates one approach one simple approach could for instance be that you just restart the policy policy evaluation phase whenever you've made a policy improvement step throwing away the past values not trusting them anymore and just recompusing them for the new policy okay now we're going to move on to maybe what is maybe a more generic approach but it's similar to lscd in the sense that we're going to consider the data collected so far so we consider that we have some experience a data set consisting of all of our experience up to some time set t now one thing we could do is instead of considering all of these transitions at once as we were doing with lstd we can just sample transitions from this data set repeatedly so we sample transition that's enumerator with a with a time step n which is smaller or equal to the current time set t and then we just do a gradient update with that so if this data was collected under certain policy and we're interested in evaluating that policy this seems like a fine approach you can actually also combine it with q learning and i'll show that in a moment to learn off policy the benefit of this is that you can reuse old data so you could just store all the data you've seen so far and keep on updating which means you can take multiple steps on the same multiple gradient steps incremental gradient steps on the same transitions over and over and thereby extract more knowledge that you maybe could do from a single gradient step this is also a form of batch learning it doesn't consider all the data at once it doesn't consider the full batch at once but instead it considers the batch to be able to sample from it of course you have to be a little bit careful that if your policy changes this might be important because if you're just trying to do prediction you don't want to use transitions for a policy that has uh has now changed or alternatively in a future lecture i'll talk about how to deal with that and how to still be able to use it but to re-weight it appropriately ok this brings us to the exciting topic of deep reinforcement burning i will touch upon that only briefly now there will be much more about this in later lectures so i'm going to start by just doing a very very brief um kind of a recap on neural networks in general this might be very familiar to many of you already but it seemed useful just to very briefly go into that a little bit so let's go to the uh note-taking app that we've been calling a blackboard and let's consider what deep neural networks essentially are as far as we're concerned there's a lot to say about this and of course i won't be able to say anything remotely scratching the surface but let me just give you the very bare minimum that that might be useful for us to understand for the purposes of this course so what have we been considering so far let's first start with the linear case and we were considering to having a value function that is defined as an inner product of some weight vector w with some features of the state now this is a linear approach which has benefits and downsides as discussed before what would be a non-linear approach well for instance we could have some function let's say f that takes some number as input and outputs a real number and we could apply that we could say that any such function um actually let me make it clear that these are both just numbers so any such function let's just assume that without bothering if we apply this to a vector then it will be applied element-wise so the idea is that we have some function which is non-linear popular examples of this include let me just draw a couple squashing functions so if this is the zero lines of a graph then the squashing function might look like this or another popular choice these days is what they call a rectifier linear unit which has a discontinuity that looks like this where essentially um it's this this one is just defined as f y is max zero y and the benefit of this is that it's linear in some parts but it has a non-linearity so you can use this to add non-linearity to your function if you want and i'll show you in a minute how um this one's often called a sigmoid or the one i actually this is more general term the one i drew here or kind of tried to draw here kind of is a 10 inch um so that's those are just specific functions and the idea is that this is some nonlinear function and then one way to make that value function on linear is just to apply that function so we could consider something like this where we have a function applied now this is not particularly useful yet right but it's already non-linear and if f is differentiable then we can still compute its gradient with respect to the weights now this is not very useful immediately why would we necessarily for instance squash the function between 0 and 1 as we do with the 10h or why would we say oh negative values are not allowed as we do with a rectifier so this one's called a rectifier linear unit no so this is not typically what we do so let me immediately correct that instead what we might do is we might say well let's first apply some matrix to our features or inputs sorry we don't need a transpose there so i'm starting the neural transform and then we multiply this again so let me call this waitress matrix 1 and then we have maybe a vector 2 and this could be our value estimate so now note that if we want to take the gradient so the gradient with one of the components of w uh 2 will be quite simple so if we consider the partial derivative of vw of some state s with respect to w to i this will simply be the function f applied to this weight matrix applied to our feature vector taken at i so note here that this weight matrix multiplied by the feature vector the feature vector is a vector the matrix is matrix so the output of this is a vector so f is applied element-wise and we're basically going to subscript this with an i to denote that we're taking the i input uh the output here there's different ways to denote that i could have also written this for instance with uh index notation like that so that's very similar to before essentially as far as we're concerned now in terms of the derivative with respect to the weights in w2 these are just features so our features are now defined as some function f applied to some weight matrix w times the feature vector and those are as far as we're concerned just our features but the difference now is that we can also consider as part of our update the gradients so the the partial derivative with respect to some element in matrix one so let's consider for instance i and j so it's a matrix right so we need to index twice to get one element and this can be computed just by the chain rule so what we'll get is essentially that will have the gradient of w vw by the intermediate thing so we take the gradient of the intermediate thing being um this one taken at i and then that thing the gradient with respect to that weight but let's do an intermediate step let's take it first just to the inputs of that function and then finally we have so let's unpack this a little bit so this part this first partial derivative is just the gradient of our value function with respect to the i feature in some sense this will just be w2 at i this next part is just the derivative of our function whatever that is and these functions above they always have some well-defined derivative the radius doesn't actually have a well-defined derivative at exactly zero but that's okay we just define it to be zero there so we extend this to have a well-defined derivative at zero so that the derivative at least is not contained not discontinuous sorry let me move down again so this will just be the derivative taken at whatever those values were and then this last bit the weight matrix times the feature vector by that one element will simply be feature vector indexed at that one element so these are all just numbers now sorry this was also taken at i so note that actually what we're considering here so i've indexed here let me use a different color to denote that here i've indexed with i which i did here as well i could also have written that as this like the f is applied to the weight matrix but indexed at i so let me just say i dot so now it's a vector so that's a different way to write that same thing so essentially whenever i index here um so let me actually pull that one inside let me get rid of that one over there and let me instead put it over there to make it clearer that we're actually indexing the weight matrix now note that the things that we're left with then at the bottom they're all just numbers this is a number that is a number that is a number and we just multiply them together to get that one number that we needed for our gradient then our gradient is just this one big vector which merges both the contributions for the second weight vector and for the weight matrix and we can continue stacking this now i mentioned before it's useful to go through these steps to apply the chain rule a little bit but you don't have to actually worry too much about that in practice these days because what we instead do we just whenever we need a gradient we apply a software package which will automatically compute that gradient so you can make these functions arbitrarily complex so one thing that you could do is for instance you could have a value function that has some weight let's say at layer l times the feature sorry so this is just some function which we could also super subscript l where this is a layer index and then this will be times the inputs let's call those y at layer l and then we could just say oh the inputs at layer l are defined as some weight matrix l this is a vector so now we have a matrix of l minus 1 times the features at the previous so we can stack this fairly deeply just expand it for one specific example let me just write it down explicitly because it doesn't hurt to uh have that in mind for instance we could have something like three layers let's say and the more layers you add when we use terminology in deep learning we say the deeper your network is this is because you can denote this this function you can denote that as there being an inputs x and then there's a weight matrix here and then we get our let's say that like let's give this a name sorry ran out of space there on the on the end let's give this one a name i1 then this could be i2 and then you could do that again i3 and then maybe then so we could denote notice as well by showing that there's lots of connections here this could be a dense matrix operation and then in the end maybe it goes into a single number where this is weight three so the idea is that you can stack these people use all sorts of notations to denote these neural networks sometimes as simple as something like this or sometimes people draw more complicated figures which show the dimensions a little bit better as you can imagine you can make these things fairly complex and weird looking as well so you could have all sorts of weird structures which merge in strange ways because maybe you want to somehow put some inductive bias in the structure of your value function this can be useful in some cases is this is a way to put for instance some more prior knowledge into your function class if that is helpful it could also be that you have multiple inputs that maybe come in at multiple places for instance you could have vision inputs and auditory inputs and maybe division input has a certain pre-processing necessarily and then the auditory inputs have a different preprocessing necessary so you could have all sorts of inputs like maybe this is some part of your observation this is a different part of your observation maybe this is yet another part of your observation and they finally all come together to give you your value estimate and then all parameters here in your network so actually let me do that in a different color to make it clear that we're not that this is not part of the network now so oh sorry that didn't actually change my color yes doesn't seem to want to change my color so let's go like that so all of those lines there there are there are weights so we just compute the gradient with respect to those weights and typically these days as i mentioned we use an auto differentiation package to do that so you can basically construct these fairly complicated networks without having to worry about computing all of these gradients by hand instead the software will do that for us and turns out these gradients can always be computed if you're if you're using differentiable functions of course these can always be computed in a compute that is comparable to the amount of compute it takes to do your forward pass so if you can do that as forward pass of this whole network in some limited amount of time and then computing the gradient will take a time that is proportional to that so it doesn't take much longer to compute the gradient okay now that's in a nutshell deep learning i want to say one more thing which is if we then consider time september time step you could consider having a state here this is our asian state now and let's go a time step further to the next aging state and these depend somehow on our observations and maybe agent state is then used in some complicated way maybe they also results from your observation in some complicated way and maybe these are used in some complicated way to maybe get our value and maybe to get our policy and maybe other things as well now this step this agent update function could be considered let me actually put it somewhere else let me put that here because the agent update function in some sense merges that arrow and this part so it merges the observation in the previous state into a new state that's the definition of the agent update function and here we see how we could structure that in our implementations as a neural network where there is now a gradient that might flow actually let me use blue to denote the gradient flow so if we want to update the prediction over here the gradient might flow into the parameters here into the parameters looking at the observation and also maybe through time maybe even further back so there could have been then you use prime s to denote the previous one and the gradient could flow all the way back through time now this is both nice and potentially beneficial but it's also potentially problematic because that means that if you want to actually have the trading flowing all the way through time you need to do something smart about how that gradient flows that's beyond the scope of this uh course that's more for more advanced deep learning considerations there are mechanisms that allow you to do that for instance by just truncating the gradient flow at some times up in the past and say well i'm just going to do a couple of steps that's called back propagation through time that propagation in general is just the term refers to computables those gradients going backwards through these non-linear structures okay so that's just a brief very brief explanation of deep learning some of that might be completely known to you already if you're previous to a deep learning curve course here i just wanted to highlight that this is one way to build these structures and that you can then use them to uh basically compute your value functions i want to mention yet one more little thing which is kind of the same as what i said before which is what conf nets are so the structure that we've seen before actually let me highlight that again is a non-linear network now i'm just going to say a special case of this is when you have vision inputs which are structured sometimes drawn like this where these are kind of the pixels on the screen so maybe you have some sort of a pac-man thing here maybe trying to eat the ghosts so of course normally you'd see much more of the pac-man screen but let's just do a simple example maybe there's a dot here that's eating as well so this could be the pixels of the screen and in fact typically you could have multiple layers of these because for instance you could have an rgb input where one of these layers contains the red one of them contains the green and so on and then what we can do is we can apply a linear layer to this and then do process through nonlinearity exactly as before but turns out if you just put all of those into like you could take all these these pixels and just put them into some sort of a large flat vector and then just do the the thing that we did before but that turns out that doesn't work that well so we typically don't do that so instead what we typically do is we apply a special linear layer which actually looks at a little patch and typically looks at that patch across the channels so these are channels so then it turns this little patch here that goes through the channels in a sense it turns that into a vector and then applies a weight matrix to give us the next patch which then gets turned into a spatial patch again and potentially again you could have multiple layers of those and in fact what we typically do is we don't actually turn it into a patch we just turn it into a so i'm actually going to backtrack that slightly we just turn it into a number so a little one pixel and then having stacked layers of those this is just a vector but i'm stacking it and then we're going to compose those again into some larger image the number of layers here the number of sorry channels here does not have to match with the previous one so we could have more channels here and the semi image in some sense could be smaller this is called the cons net and then what happens in the columns is that actually the same weights get to get applied to different patches so also over here that's called weight sharing in general and commons nets use weight sharing so essentially what's happening here is you could consider this to be kind of a filter and this filter gets moved across the image and basically on every patch in the image it gets applied the same filter and it extracts some feature information which passes it into a new feature layer and then we get something that's sometimes drawn a little bit like this where as mentioned we could have multiple planes here and then this goes into other planes where these planes could have a different numbers let me just draw three this time or four let me just make it as i go along and this one patch here would show up as a single pixel in this new thing but then there's an overlapping patch for instance here that would show up as the pixel just next to it and so on and so on and then we just can do the same thing again and then it could be a new patch here that goes into a layer above there so you could apply this multiple times these are just similar to before these are just linear layers and then we put non-linearities basically just before uh computing this uh this pixel over here so this would still have this non-linearity on top of this because otherwise it's a linear network and everything kind of collapses to a linear function approximator but nonlinear function approximation can be a richer class so instead of necessarily being linear in the observations this could now be nonlinear observation and then you can apply this many many times and then eventually you could toss it into a vector and then maybe compute your value function on top of that so that is all that the content is it's exactly the same except that the linear layer has some certain structure so one way to consider that is that this matrix here at the beginning could be a common layer and that just means that that matrix has a very specific structure where the same weights show up in multiple places in the matrix there's weight sharing and in addition there is some certain specific spatial structure where we're considering owning these patches okay so that's just a brief primer there's much more to be said about this this is kind of a poor explanation because we're only taking a little bit of time to talk about these aspects so i just wanted to give you a flavor of it if you don't fully understand that there's loads of introductory material on deep learning on the uh in the wild and there's courses on this there's lots of stuff on the internet so i encourage you if you want to know more please dig into that if you first want to just continue with the rest of this course without fully understanding all of it that's also perfectly okay okay so then using these networks in reinforcement planning is called deep reinforcement learning fortunately many of the ideas immediately transfer when using deep neural networks we can do temporal difference learning we could do monte carlo we just place the feature vector with the gradient in our updates we can apply double learning if we need it for instance double dqn double q learning we could use experience replay and all sorts of other intuitions and results they do transfer some things don't for instance ucb is hard to transfer because ucb i'm reminding you this is an exploration method that we discussed in the second lecture and it requires counts so you need to know how often you've seen a certain state or state action pair if you want to ucb turns out it's very hard to count with deep neural networks or at least in in the way that we need for ucb because of generalization so deep neural networks tend to generalize really well and this makes them learn really quickly and people are still trying to understand exactly the properties of these networks because it also depends on which optimization mechanism you use because these days people typically don't use vanilla stochastic gradient descent but they use gradient transformations on top of that now if you whether you do that or not it turns out it's just very hard to count with these neural networks in terms of the states and reinforcement learning and therefore it's hard to apply some sort of a bound like ucb does although people have considered extensions that do seem to maybe be somewhat reasonable but it's still active research also least squares methods don't really transfer that well because they're specifically made for the linear case and then you can analytically compute these compute these sorry these solutions but in the generic case for non-linear functions it's really hard to compute those solutions because there's non-linearities involved that means that we can't just immediately write down what the optimal weight vector is for a specific data set to fit it and instead what people typically do is just use gradient methods to incrementally find them so instead of doing these squares people do experience replay so let me now walk you through an example where an online neural q learning agent that combines these approach might include a neural network which takes your observations and outputs say an action out q value so the idea is that this is just a non-linear function that just looks at the pixels of the screen let's say of pac-man and then has some maybe conflares to first do the spatial reasoning then merges that into a vector that maybe has a couple of other layers and then at the end it outputs the same number of elements as you have actions and then we just index that with the action that we're interested in if we want a specific action value now note that i have ignored agent state here so the agency is basically just your observation in this case so this one wouldn't have memory and maybe that's okay for some games like pac-man then we need an exploratory policies for instance we just do epsilon greedy on top of this weight vector that we have so that our action that we actually select will be greedy with respect to these action values with probability one minus epsilon and fully random with uh probability epsilon and then we have a way it updates to the parameters of this network where we use in this case gradient q learning which you might recall the q learning update let me just briefly mention it again it's going to be your immediate reward plus the discounted value of the maximum value action in the next state minus our current estimate times the gradient of that current estimate that would be a gradient extension of q learning i did mention a couple of downsides like overestimation you could plug in solutions here as well this is just for for giving you a simple example then we could use an optimizer so typically what people do is they take this this update according to std this gradient in some sense it's a semi gradient actually because of the way it's also being here but being ignored but but people take this and then they transform it for instance applying momentum so that you take a little bit of the previous update and you merge that in or using more fancy optimizers like rms prop and the adam optimizer those are quite popular these days and then turns out this often works better when using deep neural networks details are not too important for us right now often when i said we were using these other differentiation packages uh often these days so what people quite often do what you'll see in code quite often is they implement this weight update via a loss but it's a little bit of aware of things i'm mentioning this because it makes it easier to understand versus other people's code but in some sense it's a little bit of a strange thing because there's this really strange operator here which is called a stop gradient in most software frameworks so basically what people do then is they say oh let's just write it down as if it's a loss but let's ignore the gradient flowing into this part and then if you take the gradient of this thing but you ignore the gradient going into this part you get that update that we had above this kind of makes it explicit that this is not a real gradient algorithm right we actually have to manually specify oh yeah take the gradient but don't push the gradient into that part and if you do push the gradient into that part you're doing one of these bellman residual minimization algorithms instead and as i mentioned that tends to work slightly worse in practice so you could play with this you could try that out and see if the same holds for you that that works worse and then people implement this as a loss so they can say oh just minimize that just apply stochastic gradient descent on that alternatively of course what you could also do is you could just compute this temporal difference error this q learning temporal difference error here and just compute the gradient with respect to the action values then you don't need that stop gradient operation and then you could also implement it like that what we're basically doing in some sense there is akin to doing one step of the chain rule by hand a very simple first step of the chain rule by hand to construct the update directly instead of first formulating it as some sort of a loss it's not a real loss it's not because of the stop gradients i would i'm hesitant to call that a real loss although people do call these things real losses even if people call these things losses even if they're soft gradients i'm just hesitant to do that because we're not actually following the gradient with respect to the parameters if there are stock gradients involved so this just happens to have the right gradient if you compute this with an auto differentiation package that's just for awareness essentially if you want to implement this and now we can finally move to the dq algorithm so i showed examples of this all the way at the beginning of the course where we saw this agent playing atari games and now we finally accumulated enough knowledge that we could actually replicate that so i'm going to explain what is in that algorithm and it's very similar to what we saw just now so there's a neural network exactly as i said just now the observation here is a little bit special so the observation in dqn is not just the pixels on the screen it's actually taking a couple of frames a couple of steps into the past and also doing some operations on that so the original dqn algorithm would down scale the screen slightly for computational reasons it would gray scale it instead of keeping the color information this actually throws away a little bit of information some recent approaches people sometimes don't do that anymore and in addition to that it would stack a couple of frames so it has some slight history and there are two reasons for that one is for instance in a game like pong where you're hitting a ball from one pedal to the other it was thought that maybe this is important to be able to capture the direction of the ball because if you just know where the ball is then you don't necessarily necessarily go whether it's going no whether it's going left or right in the game of pong that might not actually be that consequential that might not be necessary to know that for picking the right action but in other games you could imagine that sometimes that information is quite useful but there's a different reason as well which is that india's atari games in the simulator at least and maybe in the original games as well i don't actually know the screen would sometimes flicker so certain enemies might sometimes be there on one frame but they wouldn't be there on the next frame so therefore that was important to consider a couple of frames rather than just just one frame because the mem the uh a network otherwise doesn't actually have any memory so a different way to say that is that actually the input here are not the raw observations but there's just a very simple handcrafted agent update function which looks at the observations which could be considered raw frames and then stacks them and processes them appropriately to give you an agent state but there's still no long-term memory of course there's a need for an expiration policy and indeed in dqm this was literally taken to be epsilon greedy with a small epsilon like 0.05 or 0.01 and there was a replay buffer where you could store and sample past transitions this replay buffer didn't store all of the transitions instead it stored like the last 1 million transitions or something like that so at the time when you go beyond 1 million transitions then it would start throwing them away and in addition there's a subtle other thing which was called a target network and this was one of the innovations that made dqn work more effectively so we're going to define a new weight vector that concludes all contains all of the weights of a network of exactly the same shape and size as the neural network that defines our q value and we're going to use that to bootstrap so the weight update function is exactly the same as the online q learning algorithm that we saw on the previous slide except for this one change where there's a w minus here w minus is just a name the minus doesn't mean it's negative or anything it's just a name to say oh this is a different weight vector different from the one that we're updating w so then if we're not updating w minus how does that one update well that one is just updated by copying in occasionally the online weight vectors so what mean what this means is that during learning you're basically regressing on a different value function for a while while keeping it fixed for say 10 000 steps and then you copy in the online weight so that occasionally it does get better over time this makes the learning target a little bit more stable and it was helpful for learning in the qn this might sound a little bit similar and familiar uh when we talked about double learning for double q learning this is not quite double q learning yet because it is still using the same parameters here to evaluate the action and to pick the action it's just using this w minus now maybe something for you to think about oh how would you then implement double learning in this framework what would that look like and then there was an optimizer applied to actually minimize the loss in the original dqm this was rms prop so this update would be tossed into the rms prop optimizer which also stores some statistics of past gradients and then would compute the actual updates by taking all of that into account current day people still use rms proper adam typically or variations of this the replay and the target networks are designed to make the reinforcement learning look a little bit more like supervised learning and the reasoning behind that was essentially that if you make it more look more like super first device learning is more likely to work because we already know that deep learning works quite well for supervised learning for or for regression and classification for supervised data sets the replay kind of scrambles the data so that the sampling seems more ideal if you then if you would uh just give all of the samples sequentially and the target network just makes the regression target less non-stationary on every single timestamp and this might help the learning dynamics neither of those is strictly necessary for for good learning but in the dqn setup they did help performance and they are still sometimes used in modern algorithms as well and sometimes too much effect okay so that brings us basically to the end of this lecture where these last two things are kind of an interesting intriguing like insight into what makes deep reinforcement building an interesting field so turning the replay and the target networks um making help making that make use of that to make the reinforcement planning look more like supervised learning this was done of course because we're combining the deep learning part in so in some sense you could say this is deep learning aware reinforcement learning so we're changing the reinforcement during update a little bit with these target networks and we're using experience replay deliberately here because this makes better use of these deep networks that we're using the converse also exists there are also things that could be called reinforcement learning aware deep learning where sometimes people change for instance the network architecture or the optimizer in certain ways to maybe play better with the fact that we're doing reinforcement learning the fact that these exist that that there are some important design considerations that could be called deep learning aware reinforcement wording or reinforcement learning aware deep learning this is exactly what makes the reinforcement learning a field of its own in some sense the subfield of its own with really interesting research questions that are not pure reinforcement pruning questions they're not pure deep learning questions but they're really on this intersection of these two really exciting fields okay that brings us to the end of this lecture as always please ask questions on moodle if you have any and in our live q a thanks for your attentionhi and welcome to this seventh lecture in this course on reinforcement learning my name is hadofan hustles and today i will be talking to you about function approximation and reinforcement learning in terms of background material um chapters nine and ten from saturn embargo i are highly recommended and we'll cover some material that's captured in chapter 11 as well and in terms of context we are always considering this interaction loop where there's an agent interacting with the environment by picking actions and observing the environment and then reinforcement learning can be considered the science of learning how to make those decisions how can these agents deal with this interaction stream and somehow big actions that give it larger and larger rewards the rewards not depicted in the figure could be inside the agent's head or they could be part of the observation depending on how you want to model your problem inside the agent in any case there will be some sort of a policy that determines the actions and in addition the agent might have value functions or potentially models as well the general problem involves taking the kind into account time and consequences because the action influences the immediate reward but also potentially the agent's state and also potentially the environment state which means that certain actions might have very long-term consequences in addition these consequences might not be immediately obvious for instance an action might change some internal state variable in the environment and it might not be immediately observable what that means or what that matters but it might influence later observations or rewards now in this lecture we're going to talk about function approximation and first we're going to talk about why we need that so first we're just simply pointing out in some sense that the policy the value function the model and the agent said update and all of these things that are inside the agent can be viewed as being functions for instance a policy maps the agent's state to an action or to a stochastic policy from which you can sample an action a value function maps a state to a value which is an estimate of the cumulative reward into the future the agent state update itself maps the previous agent state and the observation into some new agent state so these are all functions and we want to learn these from experience why do we want to do that well multiple reasons one of which is flexibility as we talked about in the very first lecture i quoted a paragraph from a paper by alan turing there where he argues that maybe it's easier to program something akin to the mind of a child than it is to program something into the mind of an adult because maybe it's actually easier and also maybe better actually more flexible to write something that can learn from experience then try to list all of the rules you may have learned from experience in addition if there's too many states we will need some sort of approximation this is a separate point because this is no longer just about learning but this is really about function approximation that this is important and when we use neural networks to represent these functions then this subfield is often called deep reinforcement learning these days that term is actually relatively novel it's like seven to eight years old but the combination of reinforcement learning with neural networks is in some sense a quite natural one and is actually fairly old and was has been suggested at least in the 70s already so in this lecture we'll concretely consider predictions so value functions including some value-based control so using these predictions to come up with policies in upcoming lectures we'll talk more about off policy learning approximate dynamic programming which is uh the term we use to refer to this more generic like field of study where we're not necessarily sampling but we're just considering what if you have approximations in your dynamic programming algorithms this could include sampling this could include function approximation but it could also be from other sources potentially and this will allow us to talk about these updates in a theoretical sense and analyze what they are doing this will also be an upcoming lecture we'll also talk about learning policies directly so as i mentioned the policy itself can be considered a function so you could consider updating that function with policy gradients we briefly talked about those in lecture two we'll come back to that in a later lecture and we'll talk about model based rl as well in this lecture we're focusing on predictions because it's nice to be able to focus on something when we talk about this function approximation but a lot of the insights they do transfer to these other cases as well as we will see and in terms of motivation well why do we want to do that because we're ambitious we want to solve real problems we want to solve large problems and actually even small problems are large problems because the game of backgammon one could argue is not particularly large compared to the real world it's just a game but turns out that gammon already has 10 to the 20th different possible states so if you try to enumerate all of those in a large table in memory you quickly run out of space if you try to do that for these type of problems because if we then go to a different game like the game of go we notice that this actually already has 10 to 170 states that's an enormous number you can't feasibly store that in memory in a computer and then try to learn a value function for each of these separate states actually it can even get worse because for instance let's consider helicopter flying in which the stage spaces could arguably taken to be continuous so you could view the state space could for instance be the position of the helicopter in space so there's this three-dimensional space location but in addition maybe there's other inputs as well like the there might be a wind sensor that tells you from which direction the wind is blowing and how fast it is blowing and these inputs could all be real valued numbers which means that the save space is inherently continuous which means that if you would like to enumerate them there are infinitely many states and of course maybe as the ultimate example you could think of a robot in the real world where the environment is actually equally big as the real world is and in that case it's obviously impossible to enumerate all of the states because any computer that would have to do that would have to live in the real world and hence should be necessarily much much smaller than the world is so this is a common thing that we have in mind when we think about reinforcement learning we want algorithms that can be applied to the setting where this environment is just humongously large and the agent must be necessarily a lot smaller so the mind of the agent including its memory capacity must be necessarily smaller than the environment there are of course small problems small leash problems in which you could still enumerate all of the states and in that case maybe you could try just doing doing tabular approaches but in general we want these algorithms to also apply to these larger problems so this is our motivation for why we want to consider function approximation and the intent is of course to apply the methods that we've already been talking about for prediction and for control so let's talk a little bit about a value function approximation now so far we've mostly considered lookup tables although i've already mentioned linear function approximation and other things in earlier lectures but in the tabular case each state has a separate entry and we just update that or each state action pair has a separate entry but the problems with that is that there are too many states as i mentioned before so you cannot fit that in memory that turns out it's not the only problem a different problem is also too slow to learn the value of each state individually so think of it this way so if you have a problem that is not too large let's say there's a million different states we can store a million numbers in modern computers that's not a problem at all but then still if you would update each of these state values individually and separately then it would take quite a bit of experience to learn reasonable value estimates for each of these states and indeed this feels inefficient if these states have some commonalities which we would then completely be ignoring so if these states can instead be viewed as being similar to each other then maybe you would expect or maybe you would want the value of a certain state to be updated when the value of a very similar state is updated this is called generalization in addition to those two points individual environment states are often not fully observable so the environment might not just be big it might also not be fully observable from your observation stream so we need to somehow infer what what's going on in some sense or at least up to the point where we can make meaningful meaningful predictions or pick meaningful actions so our solution for all of these problems essentially will be to introduce function approximation for now we're just considering the predictions as mentioned at the beginning we're going to consider value prediction in this lecture and what that means is we're going to introduce some parameter vector w as before where now the value function is parametrized with this w it's a functional state somehow where the state is the agent state and we're either approximating the value of a certain policy or maybe of the optimal value function and the algorithms we'll get to in a moment but the idea is that we're going to approximate these and then update the parameter for instance using monte carlo algorithms or td learning or any other similar algorithm that you might want to use and then hopefully if we pick our function class correctly and we'll talk about that more we will be able to generalize to unseen states so in any case if you have such a function the values of these unseen states will be well defined you could of course also have this for a table you could just if you know all the states that you could possibly end up in you could at least initialize the values to be for instance zero arbitrarily but if you just have a function class that automatically maps every possible inputs to a valid outputs then you automatically have this mapping and you have a valid state estimate for every state in addition um i just want to briefly talk about the asian states but i'll actually not go into a lot of depth in this lecture i'm just going to mention it because it's very very important but we'll get back to that in a subsequent lecture so i mentioned that the one of the problems might be that the environment say is not fully observable so essentially the environment state is not equal to the observation then as always we're going to rely actually on the agent state instead and we're going to sort of the agent set update function here briefly which we could also consider to be a parameter parametric function of its inputs which are the previous agent states the action and the observation and potentially the reward as well if that's not part of the observation and it has some parameters here denoted omega now we already discussed you could pick this to be something you could pick this just to be for instance the observation the agent states at which point the omega is maybe not relevant maybe there are no parameters for that function but maybe in general that's not very particularly useful right if you look at just the observation you might not see everything that you need to see so this agent states which we here denote with a vector notation but we will continue to use the random variable notation as well should contain everything that's useful for you to make value predictions or to pick your policy and that means that the agent update function is actually quite important and might need to capture things like memory for instance it might be that you have to remember what you've seen before we talked about an example in the first lecture where maybe a robot has a camera can only look in front of itself can't see what's behind it but maybe it needs to remember what is behind it to pick suitable actions this would be captured in the agent state and maybe we should think about learning that in this lecture we won't go into that but i just wanted to remark it also to be clear that in the subsequent slides whenever i have a state it's always going to be the asian state and this is basically just going to be some vector or the observation okay now we're going to talk about a couple of different function classes in some a little bit more detail just to talk about the differences between them so we started by talking about the tabular case right where we just have a table so our function class here is tabular and then one step up from that would be state aggregation so in a table we have an entry for every single state for instance let's consider for simplicity that is fully observable so you literally observe the environment state and you just have an entry for every environment state in state aggregation you would instead have an entry that merges the values of a couple of states so basically we just partition the state space into some discrete small set potentially so that we can maybe learn the values of these partitions instead of learning them for every individual state and then indeed we would get some generalization because all of the states in the same partition would then be updated with the same algorithm and hence would obtain the same value so if you reach a new state in a certain partition that you haven't seen before then it would automatically already have a value and it might be a suitable pattern now one step up from this will be linear function approximation and i want to for now consider a fixed agent state update function and then a fixed feature map on top of that so this is why why things will be linear because we're going to assume that the agent set update function for instance because you speak the observation or do some other fixed function there doesn't have any parameters the feature map doesn't have any parameters and instead all that we're going to learn are the parameters of the value function which uses those features as input so there is student agent update function it might be a simple one but we're just going to ignore that in terms of the learning process now note that state aggregation and tabular are actually special cases of this more general class because in the tabular case we could just consider the feature vector to have as many entries as there are states and then have the entry of these current states would be one whenever you're in that state and all of the other entries would be zero so it's a one volt representation as they sometimes call this where the state is kind of picked out by the feature vector now state aggregation is very similar except that the the size of the feature map is no longer necessarily the same size as the number of states so n what we call that here on the slide the number of outputs of this feature mapping so the number of features in your feature mapping would not be equal to the number of states would typically be much much smaller and you could indeed even apply this to continuous state spaces but then it would still be a one-hot representation when we do state aggregation so there's some mapping that says oh if you're in this states then this entry must be one and everything else must be zero um and if you're in similar states maybe the same entry will be one and everything else will be zero but if you're in a different state then a difference further away for instance right then a different entry would be one so it's still a one note representation but it's a smaller feature vector than uh if you would do the fully tabular approach and both of those are special cases you don't need to have a onenote feature representation you could consider much richer function classes here now one step up from that if we're going to be even more flexible would be to consider differentiable function approximation in this case our value function will be a differential function of our parameters w and it could be nonlinear for instance it could be a convolutional neural network that takes pixels as input you don't need to know right now what a convolutional neural network is if you don't know what that is it's just some nonlinear function and one way to interpret this is that you can consider the feature mapping now no longer to be fixed so we take whatever the agent status as input and for instance the input could just be the observations could be the pixels on the screen when we're say playing an atari game or something like that and then instead of having a fixed feature mapping you could consider this to be a learned feature mapping where something maps it into features and then there's an additional parameter on top of that that maps it into values in the notation we're going to merge all of those features into one big vector w but it won't be a linear function it will be some nonlinear function where these parameters are used in different ways so what we're requiring here why we call this differential function approximation is that we can still compute gradients because we're going to consider gradient based algorithms that's not a strict requirement but it is one that we're going to use here and it allows for very rich function classes such as deep neural networks okay in principle any of these can be used but it's good to acknowledge that reinforcement learning does have certain specific properties or maybe often has certain specific properties which might interact with the function class so it's good to be aware of those the first one that i'm going to mention is that the experience is not identical and independently distributed with which i mean that if you're going to learn online from a stream of experience it's just coming at you then successive time steps are actually correlated this is not necessarily a problem i'm not saying it because it's necessarily a problem but i'm saying it because it's slightly different from the case in which many of these functions have been developed for instance deep neural networks were originally often developed and tuned in cases of classification for instance amnesty digit classification and then some choices that were made there might not be completely appropriate for reinforcement learning because we're breaking some of the assumptions such as that we can sample each of these learning examples independently in addition to this the agent's policy affects the data it receives what i mean with this is that the reinforcement is an active problem and this has multiple consequences including that actually one benefit of this is that you can maybe actively sample data in a way that is useful for learning your function you could literally have an expiration technique that maybe seeks out certain parts of the state space in order to improve your value estimate specifically over there you can't actually it's not active learning you can't just sample any sample that would be uh within the field of active learning that's the typical assumption that you have a database and you want to um indeed actively pick which samples but you can pick from all of them that's not typically the case in reinforcement learning because the way we're interacting with the world so if you want to sample a certain situation you actually first have to go there but it is an active process and that's interesting but it also causes some issues for instance the regression targets are often non-stationary this is for instance because we're doing control like we might be changing our policies as we go and this might not just change the targets for our regression our value targets for instance if we bootstrap but it might also change the data so it's actually non-stationary in more than one way in addition like i said the bootstrapping is important so even if the policy is fixed so even if your policy is not changing the targets might change because your value estimates that you're using to construct your targets could be changing so in td we're using our value estimates and those initially are not very good but later they might become more accurate in addition to this the world itself might be non-stationary for instance there might be other learning agents in the world and maybe we also want to deal with those cases and similarly but differently the world might just be very very large so this might mean that you're never actually quite in the same state and these last two points actually quite closely related because if we allow for the world to be partially observable then if the world is sufficiently large it might appear non-stationary to you because you might think oh i'm going to that room i'm going somewhere else and going back to that room things have slightly changed maybe the sun is now setting whether it wasn't setting before or maybe all of a sudden there's a different agent in the same room and this is not maybe literally non-stationary maybe the actual underlying say physics of the world are not changing but you don't know all the the the latent variables that are in the environment states that make these things change so in some sense non-stationary and very large are in some sense similar but of course they have different connotations you could have a very tiny non-stationary problem as well so which function approximation should you choose we covered a couple of function classes well it kind of depends on your goals and one distinction to make is that for for instance the tabular case we have really good theory we understand these algorithms really well we know when they converge we know where they converge they're quite stable the assumptions for converges are not very strict so you can fairly easily apply them and be pretty certain that they will learn something in addition i didn't put this on the slide but they actually also tend to learn relatively well and fast if the state space is small enough but they don't generalize they don't scale you can't apply them to very large problems as discussed now in the linear case we're already a little bit more flexible on our function class we still have reasonable boost theory so we understand these methods quite well we know when they converge and where they converge in many cases but of course we are dependent on this fixed feature mapping so we do require very good features now this is a bit of a subtle point because the people are like people are still debating this in some sense because there's multiple aspects to this one is that of course you're going to be limited by the features that you have and maybe that would be a good reason to learn these features i i most people these days are in that camp but you could also argue that maybe it's okay to have somewhat poor features as long as you have sufficient of them so the other alternative would be to have linear function approximation but with a very very rich set of features and then maybe you still have a good enough set of features that you can pick out the ones that you really can use in the context that you require them and then of course there's nonlinear function approximation which is less well understood but it scales really well and it tends to work quite well in practice until it doesn't and then sometimes we don't really understand what's going on in some edgy cases but of course our experience with this and our understanding of this has been growing quite a lot over the recent years and it's really flexible and importantly it's it's much less reliant on picking good features by hand and i personally find this to be a very convincing point and a very good point because one thing that this means is that you can apply these methods to problems that you don't really understand that well but as long as you have a well formulated reward function that you are certain is the right reward function you could apply these reinforcement methods and then still get better find good policies for these domains without really needing to understand what appropriate features would be and in fact we see more often these days over and over not just in reinforcement learning but also in deep learning and related fields when applied these algorithms that don't try to hand engineer features but instead try to learn them from data tend to outperform methods in which these features are hand engineered so that's interesting and it relates again back to that point that insuring was also trying to make so deep neural networks are typically used in this last category of non-linear function approximations and in some sense depending on how you define these things some people would argue that maybe any nonlinear function could be argued is in some sense some sort of a neural network it might be a weird one another very strangely structured one but in some sense these are somewhat similar so sometimes neural networks are used almost synonymously with nonlinear function approximation and these tend to perform very very well in practice and therefore remain a very popular choice and there's also lots of research of course happening trying to understand these methods better okay now this brings us to our uh next topic where we're going to talk about gradient based ref well be creating based algorithms first a little bit in general so we're going to talk a little bit just a very brief primer on gradient descent and so basically descent um and we're going to talk about this because because of course we're going to use that for value learning in a moment so for now let's consider some arbitrary function j this is just some function only of the parameters right there's no state input there's just parameters as input and we're going to consider it's gradients just as a reminder a gradient is a vector and it contains as its elements each of the partial derivatives for this function j with respect to that one parameter out of w so ga here is a scalar function with which i mean that the parameter vector can be large but the output is just a number and we're good considering the gradient of that that function j with respect to the parameters which we of course should evaluate at certain parameters right the gradient depends on where w is at this moment and the goal could then be to minimize j so one way to do that is to move our parameters w in the direction of the negative gradients and one way to do that would be the algorithm on the slide there's actually different ways to do that as well that are similar but slightly different differences how they pick the step size or exactly how they move or the direction they move but a lot of the algorithms derive essentially from this main algorithm which is the gradient descent algorithm where we have some step size parameter and we just move slightly in the direction of the negative gradient the one half here is optional maybe i should get rid of that this is kind of assuming that there's a square lost that j is some sort of a square and that therefore the the the one half will cancel out with that so the step size parameter should be in some sense small because then if we take a small enough step we're in some sense certain that we're going to decay as long as this function j is smooth enough so you what what i mean with smooth is that the function counts have a very big discontinuity so if you know your function is somewhat smooth then you know if your step is small enough that the gradient will have pointed in the right direction and indeed it will go down now of course in practice there's a bit of a trade-off happening here because you don't want to make alpha too small because then learning is very small so instead you typically tune this to get good performance now if we plug in value functions so here we have the square loss that i promised for instance we could consider this one where we are simply interested in this one number right again j doesn't take serious input it's just a number that depends on your parameters w and the way it depends on that here we define it to be the expectation over some distribution of states for instance due to your policy and of course the mdp dynamics and we want to consider the square difference between the actual true value of your policy so we're doing prediction in this slide and our current estimate and that's something that we might want to minimize right this will be a predictive loss and then we can just consider this gradient descent algorithm here the one half comes in handy because it cancels out with the two that comes from taking the derivative of the square we note that the distribution in this case does not actually depend on our parameters so we can just push this gradient into the expectation and then we get this algorithm this update if you will where the update to the weights delta w would be small step size alpha times the expectation under that distribution over states of the target which is our true value function for for that state minus our current prediction times the gradient of that current prediction this is an expectation so we can't actually typically use that until unless we really have the distribution over states but if we want to use this in practice we can sample this and that means that we're going to sample the state we're in and we're also going to sample this true value of the policy and one way to do that would be to plug in the monte carlo return which is indeed an unbiased estimate for the true value of the policy so this update this stochastic gradient descent update on the slide here is an unbiased estimate of the gradient descent algorithm one line above it and that means if we take our step size to be small enough and we take enough of these steps that we will on average move in the direction of the gradients and we will also reach the same basically conclusions we reach the same solutions as as the great as the full gradient algorithm would so in that sense the caster gradient descent is fine and it's just another reason to pick a small step size so not just because then the gradient is valid but now also to average out the noise in our estimate of the gradient so we are often a little bit sloppy with notation um this is just kind of a warning for that so whenever we drop terms so let's say we for instance write the gradient of v and we don't have w's anywhere and we don't specify where we're taking this uh then typically we mean when we have a gradient we mean with respect to the parameters so we have gradient of v just means with whatever the parameters of v are so if those are w then this would mean the gradient with respect to w and then we're also typically taking that at whatever the current value of these parameters are so this is something that you'd actually have to be explicit about the gradient itself outputs in some sense a function and you still don't have to plug in the parameters where you're evaluating that gradient and this is typically done in notation with this bar where the subscript tells you where we're evaluating the gradient but people are often very sloppy with the notation in this and there's lots of shorthands for instance people might might write instead the gradient with respect to wt of the value function of wt instead of having this bar notation and that's fine as long as from context is clear what we're talking about so i just wanted to highlight that here so that you have a feel for oh if there's something that gets dropped from a notation how should we interpret it it should be clear from context okay and now we're going to go and talk a little bit more in depth about linear function approximation which is going to be useful to understand the algorithms and also some of the limitations of the algorithms so we talked about this previously but i'm just going to reiterate a couple of things that we said before we're going to represent the state by some feature vector where the state sorry this feature mapping x is going to be considered fixed so we have n elements here and different features these are just numbers and they're functions of state and it might for instance take the observations as input or more generally your agent states and outputs some feature vector of a certain size and we also introduced some shorthand where if we apply this function to state at st so at time t we can also just write x t and we can kind of pretend that these are just our observations as far as the algorithm is concerned the algorithm only gets access to xt it doesn't actually get access to st anymore it just gets access access to these features instead so for example these features could include distance of a robot from some landmarks or it could be trends in the stock market or it could be peas and palm configurations in chess or whatever you can come up with which might be useful for the problem that you're trying to solve and i'll also show you an example where you can somehow find ways to kind of automatically pick reasonable features for some some problems so this is that example um this is called course coding and it's one way to come up with a feature vector it's a little bit related to that state aggregation that we talked about before but in this case we're not going to actually subdivide the space into disjoint chunks so what you see here in the picture is that we've actually subdivided a two-dimensional space for instance you could think of this as a location if you want to make it concrete i didn't make it very concrete here and we're going to subdivide this space but into overlapping regions and then whenever we're in such a region so we're here at this x say and then we're going to have a feature that is associated with each of these circles when you're in the circle the feature will be one and whenever you're outside of the circle it might be zero if you do that with this representation in this case three features will be equal to one and all the other features would be equal to zero so this is no longer a one-half representation as we discussed for the state aggregation or the tabular cases in this case it would be a few shot representation in some sense the features are still binary are either zero or they're one but it's some sense already a richer representation because as we can see here from the figure if you get these three um regions in some sense to light up we actually know we must be in the intersection of all three of them and that can be useful information but in addition to that we will still have some generalization if we're going to update the value associated with a state in this very darkest region over here that means that a different state say at y will actually also be updated a little bit because it shares one of the features with this state that is going to be updated so if we think of the linear function that is going to be applied to this it'll have a weight associated with each of the features so if we change the weight of that feature that means that any position that shares that feature where that feature is won will also experience a slight change in its value now i described the binary case for simplicity where you're either in a region or you're out and your your associated features either one or zero but of course that's not the only option you could instead also do something like over here where you have some sort of a distance function that might be for instance gaussian and instead of saying you're in you're in a circle or you're out we instead have a bunch of center points of these circles and we just measure how far you are from each and every one of them which means that in this case for instance we might have this feature light up a little bit uh well actually quite a bit because we're a little bit closer to its center and this one light up a little bit less because we're further away from the center maybe even the feature over here still lights up very brief very very lightly but very much less so than the features much closer now of course you would have to similar with the circles you would have to consider how far do you want your feature to be sensitive but it would be provide a way to be a little bit more continuous with your value functions whatever you do there basically the idea is then you have your representation and you expand this into a whole bunch of different features and then you can press that with your weights which are here to know that theta but you can think of this as being the same as the w from the previous slide into a single number why do we say expand here well in this case note that we had a two-dimensional representation we could have also maybe had say an x-y location which would constitute two two numbers and we could have fed that to the agent but maybe it's really hard for the agents to turn those two numbers with a linear function into a valid value estimate right the only thing you can do then is some sort of a linear linear function over the state space you can't have any non-linearity in terms of your location and how the value reacts to your location if you would only feed in the actual location numbers the x y coordinates so what we're doing here instead is we're turning those two numbers of this two-dimensional space into very many numbers as many as we've defined these circles so in some sense we're actually blowing up the representation in order to then be able to apply a linear function to that this is a common technique it's also used in for instance support vector machines doesn't matter if you don't know what those are but if you do then you might recognize this principle where we're basically just blowing up the feature space into something large in the hope that then the feature representation will be rich enough that a linear function suffices that if you then apply a linear function to these features you still have a good enough approximation now it might immediately be obvious that there are some limitations to this approach or if not then maybe let me point at some for instance it might be quite hard so if you only have two dimensions maybe you can define these circles and you're good to go but what if you have many many dimensions what if you have 10 dimensions or maybe even 100 dimensions and this is not very fictional because of course if we think of locations in the world maybe we'll think like two dimensions three dimensions maybe that's enough but as i mentioned for the helicopter example a helicopter might not just be in a location in three-dimensional space but it might also have other sensors it might have some audio sensors it might have wind sensors it might have other sorts of sensors it might be measuring i don't know air pressure or humidity so these would all be dimensions and it might actually be very hard to somehow figure out a good subdivision of that space without making the feature representations extremely large this is sometimes called the curse of dimensionality which is actually termed due to richard bellman also known from the bellman equations okay so this is just meant to be an example of how you could somewhat automatically construct these features because you could imagine if you do have a fairly small dimensional space like two dimensions that you could just sprinkle a couple of these circles in there you maybe don't have to think very carefully about where they are maybe you don't need to understand the domain that well for you to get a reasonable function for a reasonable feature representation but of course there is a choice here and here's a couple of examples so on the left hand side in a we see narrow generalization we've picked these circles to be fairly small and what this does is it allows you to be quite specific so if the value function is updated only in this region that lights up and then maybe mostly updated in the smallest region because of your representation that means that you have narrower generalization if you update this state states fairly near to it get updated as well but states further along or not and that's both good and bad this is a double edged sword if you have narrower generalization this means the narrower it becomes the more specific your value function can be and the more precise it can be if you need very like high resolution in some sense in your value function because the actual value function might for instance be quite bumpy or weird then it might be useful to have very narrow generalization but it does mean that learning might progress quite slowly because there's very little leaking from the state information from states further away so instead you could consider having broad generalization where the the overlap is much larger the circles are much larger then updating just a state in the middle would actually update a fairly large region of your state space and this can be useful because it might speed up learning but you lose out in terms of your maybe your resolution at the limit when you've learned your value function there's a limit to how precise you can make it at every situation of course depending on your dimensions here like we don't know what x and y are here in terms of dimensions it could be that you actually want to pick something more custom that the best representation would be asymmetric so this already shows it alludes to the fact that it can be quite tricky to pick these things in general the other thing i want to note is that we're actually aggregating multiple states and this means that the resulting feature vector and hence like what the agent observes in some sense will be non-markovian what i mean with that is that you're in this little region here in the middle and let's say the agent moves right but it only moves a little bit then you would end up in the same region the same features would light up so from time step to time step while you remain in a small little region in the middle the feature representation would not change while you're moving within that region and then at some point you actually step out of the region into the next one now this is why that's non-markovian because the time step at which that happens as far as the agent is concerned it can't tell right so there's actually some probability distribution happening here where the underlying state of course there is a certain moment which you actually transition let's say the let's say the effect of the actions are deterministic so when you move right you actually move right so maybe there's like a true transition which is completely deterministic but the agent doesn't know when it happens so what happens here instead is that as far as the agent is concerned so at some random point in time you'll transition from one region to the other this is the common case this non-markovian is the common case when using function approximation it's not specific to this it's just very easy to visualize in this representation but whenever you use function approximation including deep neural networks and so on you should count on the fact that in some sense as far as the algorithm is concerned there might be some partial observability because the function approximation will if it's a good function approximation it will in some sense generalize things so for the linear case specifically we do want to consider when good solutions even exist so it's good to kind of do this mental check so if you have if you're given a certain feature representation or if you're considering a certain feature representation if you're picking it yourself it's good to imagine well if i would have the best possible weights what would it look like is it good enough is my um generalization say narrower enough that i can have good enough resolution or is it broad enough that i can have fast enough learning so it's good to just think that through for your future representation and then sometimes you might catch things so oh there's no way you could learn this value function because it simply cannot be distinguish between these two different important states for instance neural networks tend to be much more flexible in that sense so if you just give pixels to the neural network it could itself figure out how to maybe subdivide the space in some sense right and come up with some sort of internal feature mapping um that is maybe more suitable to the actual problem that you're trying to solve because you could imagine that if you want to make these feature mappings flexible if you could make them flexible that this could be a lot stronger where maybe you have asymmetric generalization in some part of the state space broader generalization in another part and very narrow generalization in some parts where it's really important to get the values right a neural network can in some sense automatically find that okay now we're going to move on to linear model-free prediction so we're going to talk about how to approximate the value function with a linear function approximation we talked about this before but just to be very explicit we're going to approximate the value function by having this parameter vector w multiply with this feature vector x which we can write out very explicitly by just having a summation over all of the components of both of these vectors that's just a dot product and then we could define our objective to be as before a quadratic objective where we compare to the true value v pi for now let's just assume that we have that true value and we'll go we'll plug in real algorithms in a moment but for now we just consider that we have those regression targets and there's going to be some distribution over states this distribution is called d here and for instance it could be due to your policy or maybe you have some different way to sample the states then this algorithm can converge to the global optimum if we do stochastic gradient descent um if we would regress towards this true value function even if we use a stochastic update where the stochasticity here comes from picking the value function so you see the update here and the update rule in this case is somewhat simple because in the linear case the gradient of our value function will simply be our feature vector so that means that our update to the waves our stochastic gradient descent update will be step size times prediction error the true value minus our current prediction where the true value is v pi and our current prediction is vw times the feature vector the feature vector here is just the gradient of your value function in the linear case those are the same now we can't obviously update towards the true value function if we don't have that yet so instead we're going to substitute the targets this is similar to what we talked about before when we talked about prediction and for instance for monte carlo we could plug in the monte carlo return but of course we could also do temporal difference learning and then we could plug in the temporal difference target which is just in the one step case it's just the immediate reward plus the discounted value at the next state so this is now a placeholder for the true value of this state and we can use that to have data dependent updates where we don't have any hidden information anymore we don't have any privileged information about the true value of of the state of course we can also go in between these with tdt tdtd labda as discussed before where g lambda is a louder return which bootstraps a little bit on every step and continues a little bit on every step and then this lab that trades off between these two algorithms td and monte carlo as mentioned earlier we're not going to go into there now but in an earlier lecture we then turned this into editability traces where we could still have a causal algorithm that can every step can update rather than having to wait all the way until the end of the episode so this is just a reminder that td labla can be implemented to not have to wait all the way into the end of the episode and in fact so can monte carlo now return is an unbiased sample of the true value and therefore we can kind of view this as supervised learning where we just have a data set of states and returns and then the linear monte carlo policy evaluation algorithm which uses that return but then instead of the gradient has its feature vector will converge to the global optima this is known because we're basically trying to find the minimum of a quadratic function which is convex and then stochastic gradient descent with an appropriately decaying step size we'll find the optimal solution so that's nice under some mild assumptions one of the assumptions is that we sample iid but it turns out we can relax that assumption and it will also find the right solution if you don't sample fully iid as long again as your as your step size decays sufficiently and of course there's some other conditions as well just to mention one of those you can only find the global solution of course if your data supports that so one thing that you need for instance is that you can never make an irreversible decision right if there's just one state where you make a decision you can never go back that means that the information before going into that state won't have been visited infinitely often which means that convergence then for those states at least is off the table this is an assumption that's often called ergodicity there are similar other named assumptions as well but it basically means that your mdp is connected in the sense that you can always eventually at least return to states you visited before so this actually converges also when using nonlinear value function approximation although then it can convert to a local what we mean here is that if you have a non-linear value function your lost landscape might have multiple hills and valleys stochastic gradient descent on that lost landscape is still guaranteed to go down but it's a local algorithm so it could go down into some valley gets stuck there even though there's a lower value so a way to have lower loss somewhere else but it's still nice to have that convergence because it turns out we can't always guarantee a convergence and sometimes you might have parameters that diverge that go to infinity for instance we'll see an example of that later in this lecture so of course we can also consider td learning we know that the target here is a biased sample of the true value but we can apply the same trick where we consider this to kind of be training data where we have states and we have the associated targets for those states and we could update towards those in a very similar way so this is very similar to monte carlo this is basically just doing a regression in some sense but using the td error rather than the monte carlo so this now is again to a non-stationary regression problem because we're using the value function itself to construct the targets that's good to note that there's something that is different here from the monte carlo case and the target depends specifically on our parameter so it's not just even just any non-stationary regression problem it's one where we're actually updating the parameters ourselves and that makes the targets more stationary and that's something to keep in mind because this turns out to be important for the stability thief of these algorithms so now we're going to very briefly touch upon control with value function approximation this will be relatively straightforward because we've talked about these things before so just as a recap we're going to do some version of policy iteration in some sense and policy iteration depends on two steps one is policy evaluation and the other one is policy improvement to start with the latter for policy improvement we consider something fairly simple so if we estimate action value functions we can just use epsilon greedy policy improvement or greedy pulse improvement but then you wouldn't explore sufficiently but we can just consider something fairly simple there but then the approximation step is the difficult one if your state space is really large and there are therefore we're going to plug in value functions approximation for that step so we're basically just going to approximate the value function for the state action pairs in a very similar way as before and if we use a linear function we could for instance define our feature vectors now as dependent not just on state but also on action and then they have the shared parameter vector w which is applied to this feature vector and then gives you the suitable value estimate the update is as before this is extremely uh similar to what we had before except that everything now depends on states and actions rather than just states and in the linear case it turns out there's actually multiple things we could be doing so a different way to approximate the action value function is as follows where instead of defining a feature vector that depends on state in action we could have a feature vector that only depends on state now the action value function is now no longer a scalar function which is why i bolded it here so this is the difference between the previous slide let me go back to the previous slide first here we're doing the same thing as we were doing for value fun functions for state value functions where we now have a feature vector that depends on both state and action and then the q value is just a number right so we have our feature vector x which is the same size as our parameters w and the inner product will just give you a number now we're going to do something different where now we have a feature matrix w and we're going to multiply a feature vector that only depends on state with this feature matrix and the feature matrix is shaped in such a way that the output of this will have the same number of elements as we have actions a different way to write down the same thing would be as follows where we still have this scalar function for state and action and it's going to be defined as indexing this vector that we get from this first operation with the action so we have a matrix that has size number of actions by number of features so when we multiply that with the number of features we get a vector of size number of actions and then we just index in there to get the action value for this action that we're interested in this can also be written as follows where we basically have a separate parameter functional per action but we have a shared feature vector so let me again go back previously we had one shared parameter for all actions and we had separate features for each action now we have one shared feature vector and separate weights per action this is a slight difference but it might be important in some cases the updates are very similar to before so we can just update the parameters now associated with just the action that we're interested in and we can just not update the parameters of all the actions that we are not interested in because they correspond to different actions so b here is just any action that's different from the one that we're updating a and equivalently we can write this down as follows where i here is an indicator vector that indicates the action so it's a one-hot vector that has as many elements as there are actions and then only the element corresponding to the action that we've just selected has a has a value of one and all of the other elements has a value of zero and then we take the outer product with the state features and this will give us the update to our parameters now this might look a little bit complicated feel free to step through of this fairly like much slower than i'm doing right now but i also before you do that it might be good to recognize that actually a lot of these things can be automated there's a lot of software packages these days that use auto differentiation that allow you to basically compute derivatives without doing them by hand so i want to maybe make you aware of the differences between these methods because they are different these different ways to represent an action value function but then in terms of computing these updates we don't actually have to manually compute these outer products for instance by hand instead we can just call a gradient function from someone from auto differentiation package and use that examples of those include tensorflow pi torch and i like jax these days to do that okay so this raises a question should we use action in where the action is part of your feature representation or action out where we just have a state dependent feature representation and then just compute all of the action values for all of the different actions so the action in approach is when your feature vector depends on state in action you have the shared weight vector w the action out approaches when you have a weight matrix w and you have features that only depend on state one reuses the same weights across actions the other one reuses the same features or transactions it's unclear which one is better in general but they do have some certain other properties for instance if we have continuous actions it does seem easier to have an action in approach because it's you can't have an infinitely large matrix so the action out approach seems much more troublesome if you have continuous actions a different way you could deal with that is to parameterize your policy differently not have literally a separate value for every action if you have infinite actions that might be hard you can do that by having an action in network or there are other ways you could do that if you want to do something similar to an actual outdoor network but we won't go into that here i just wanted to to be aware of these choices essentially which are in some sense representation choices now we're picking our function class in a sense and even if we restrict ourselves to a linear function class it turns out there's already design design decisions that you have to make and these might matter for the performance of these algorithms for small discrete action spaces the most common approach these days is action out so for instance the dqn algorithm which i'll come back to at the end of this lecture that was used to play these atari games did this approach where there was a neural network a confident in this case which would take the pixels as input then go through a couple of layers of convolutions and then a couple of fully connected layers if you don't know what those terms mean that's okay but then in the end it would output a vector with the same number of elements as there are actions in atari there is up to 18 different actions so this is a fairly manageable size so we just output a vector of size 18 and the semantics of this vector would be that for each of the elements this would correspond to the value of that specific action for continuous actions it's much more common to have the action in approach where the action is somehow given as an input to the network so you have to represent it of course somehow so the network can easily use it but then you can do that okay now we talked of in the previous lecture about sarsa as td applied to state action pairs so it inherits the same properties what we're doing now is moving towards td algorithms right um but it's easier to do policy optimization with sources than this with td because we have these action values and therefore we can do policy iteration so kind of obviously we might want to do sarsa to learn these action values with function approximation and indeed you can do that for instance for this small example here it was done for this small example which is the game of well it's not really a game sort of like an example a benchmark of mountain car so what is it how does this work let me explain the cart here starts at the bottom or somewhere maybe randomly in this valley depending on which version of this environment you look into and the goal is to drive up the hill and the hills to its right there's only two actions or three depending on the version of this uh setting um and these actions correspond to going full force in one direction full force in the other direction or not doing anything at all and this is literally applying force so if you pick to go right that doesn't mean that you'll drive with a fixed velocity to the right no it means you're applying a certain acceleration to your car so there's a momentum here to the card and also of course if the card happens to be driving downhill it will speed up if it happens to be driving uphill it will slow down it's a very simple physics uh simulation in some sense and then the goal is to go up that mountain to the to your right but turns out the car is not strong enough to do that it's engine is not strong enough so what the actual optimal policy is is to go left first then start driving down and use the momentum from going down to then drive up the other hill and this is sometimes used because it's a somewhat tricky exploration problem if you don't know there's any goals to be had or any like rewards to be attained that are non-zero it could be quite hard for the car to figure out it should leave this ravine at the bottom and what we see here now is um a value function after a number of episodes and we see that the shape of that value function changes over time as we add more uh episodes the interesting thing that i wanted you to basically look at here is not just the evolution of this value function but also its shape at the end which is quite special it's not a very simple value function this is because um it depends where you are if you're far enough up the hill to the left then the optimal value would be to go the optimal choice would be to go down and then up the other end but if you happen to be slightly further down you're not high up in the uh eye up the hill enough yet then it could be better to um go up further and this also depends on your velocity if you're here and you have a certain velocity or let's pick the down like the bottom part of the valley if you're here and you happen to have no velocity then you should be going to the left because you should be driving up this hill first and then down and up the other but if you're here at the bottom but you do have a velocity at some point you have enough velocity that your optimal policy switches from going to the left to going to the right straight to the goal so you can see these discontinuities kind of showing up in the shape of this value function here where you can see these ridges and these are indeed the differences between am i already going fast enough that i should be going right and will i then reach the goal very quickly or am i too far away and should i be going to the other end and eventually i'll get there but my reward will be lower or my values will be lower because of discounting so this is a discounted problem and this is why you can see that the agent basically wants to go as quickly as possible alternatively not it's not just because of discounting you could also for instance give a reward of minus one on every step and you would see similar discontinuities as well so what was the representation here for this mountain car it's using tile coding and i'll briefly explain what tile coding is without going into too much detail basically what was happening here is there's these two inputs there's the velocity of the agent let me go back to this one and there's its position and this was basically cut up into small little tiles by discretizing them now what tile coding does so the simplest version of tile coding is just state aggregation you just cut up the whole space into little squares and you're done like this is this is your feature representation alternatively you could also cut it up into larger squares but then have multiple of those overlapping this is similar to what we talked about for course coding where now instead of being in one of these squares you could be in multiple squares at the same time one such discretization of the sage space we could call a tiling of the state space and then tile coding could apply multiple tilings that overlap but they don't exactly overlap so they're kind of offset from each other so that you could be in one tile from one tiling but in a different tile from a different tiling and if you move slightly you transition from one to the other tile in one of the tilings but not in the other necessary this is very similar to the circles we saw before for course coding so it's basically the same idea but with squares rather than circles and then if you have that you have some sort of a feature mapping you can apply a linear function approximation to this and this was applying linear success to that and here's a plot showing the performance of the algorithm where the y-axis is the steps per episode so this is not your reward but it's very related to the reward except that lower is better fewer steps per episode means you're doing better so this is a measure of performance where low is better and this is averaged over the first 50 episodes and then also over 100 runs to get rid of some of the variance in the results and we can see here an n-step approach where there is an n of one or two or three or well three is not there but four or eight and we can see something similar to what we saw before where depending on the step size and the the uh the end step in your return you get a slightly different profile where larger ends tend to tend to prefer smaller step sizes but then for intermediate ends you you get the best performance we saw this before for pure prediction it turns out to also work for control that these intermediate values tend to work uh well especially if you're interested in a limited amount of data note that the x axis is actually our step size times the number of tilings in this case eight so this is a common thing because what we often do if we do something like tile coding note that all of your features are binary right they're either one or zero because you're either in a tile or you're not so the number of features that are equal to one in your feature representation is equal to the number of tilings and then it turns out it's convenient to basically pick your step size in such a way that takes it into account because the magnitude in some sense of your feature vector now depends on the number of tilings this is more generally a useful thing to keep in mind that the magnitude of the features might matter and if you're just doing the std algorithm you might actually want to adapt your step size to take it into account for instance if you know the average magnitude of your features maybe you want to somehow adapt your step size to be lower if that magnitude is higher or vice versa alternatively a similar but opposite approach would be to normalize the feature vectors maybe you don't want to do that on a case-by-case basis because that might change the semantics of a feature vector but you could have some sort of a global normalization which basically means that the average feature vector for instance it has unit length and that might make it easier to pick your step size if you don't do this maybe somewhat obvious in hindsight if you just consider the same algorithm but you don't take into account the feature vector then you could consider the exact same algorithm but with different features where the only difference between the features is that there's some constant scaling and then turns out the performance is different if you don't take it into account by picking your step size so that's like this little subtlety that's good to be aware of in practice what people often do is they tune the step size to get good performance for instance by just running at a little little bit of time or running it on some similar problem if you're doing something uh where you need to do well immediately on the real problem but often we do these things in simulator so you can run this multiple times and you could just find out what good values are and then it's very useful to plot these plots like we have here which is called a parameter study we're basically just not looking at what's the best performance that i could tune but we're actually looking at how sensitive are these algorithms two different step sizes n-step parameters and so on okay that concludes that example now we're going to talk more generally a little bit about conversions and divergence of these algorithms so when do these algorithms converge what do we mean with convergence we mean that they find a solution reliably that they find some sort of a minimum of something so do they converse when you bootstrap when you use function approximation when you learn off policy and turns out it's a little bit subtle it's not always the case ideally we want algorithms that converge in all of those cases or alternatively we want to understand when the algorithms converge and when they do not so we can avoid cases where they do not and we'll talk about this at some length now so let's start with the simplest case which is the monte carlo setting i talked about this already i mentioned that this is a this is like a sound algorithm it will converge so the monte carlo algorithm if we're not considering the stochastic case we're just going to consider like the the loss there first you can see there an eq equation which has an argument over something but let's first focus on the on the thing inside there there's an expectation over the policy this is similar to what we had before instead of having a d there i now put pi there to basically illustrate that the distribution of states but also the return depends on your policy in this case and then we consider this squared loss which is our monte carlo loss essentially and we want to find the parameters that minimize the squared error we're going to call that wmc w for monte carlo and i'm arguing here i'm basically just stating that the solution to this will be this equation there on the right this is a linear like this basically linearly square solution so if you've seen that before this would look very familiar and what we see here that is essentially it's the expectation of the outer product of your features and that expectation is then inverted this is a matrix so this is an inverse matrix times the expectation of the return times your features and that turns out to be the solution that monte carlo finds we can verify this so let's just step through that when when are we done like when is the algorithm converge well it's a gradient-based algorithm so we are converged when the gradient is zero because then or i should say if the expected gradient is zero because we're doing stochastic gradient descent because if the expected gradient is zero of our uh loss essentially the thing that we're optimizing that means that in expectation we're no longer changing the parameter vector so when this is the case this is then we have reached what is called a fixed point so we can just state okay well we're going to be at some sort of a fixed point let's assume that we don't know what wnc is yet we're going to derive that here so we're just going to take the gradient with respect to those parameters and we're going to say well that's going to be equal to zero or at the fixed point so then we can just write that thing out so first thing to do is to write out the value estimate explicitly as this linear function that it is and then we're going to move things around so first we're going to pull the feature vector outside of the brackets inside we notice that this inner product times a feature vector can also be written by putting the feature vector in front of this inner product because this is just a number right so we can basically move this one around but then we can consider this to also be kind of like an outer product times your weight factor instead because the order of uh products here doesn't matter right we could put brackets around the inner product here or we could bracket brackets around the outer product here that's the same thing the otherwise the x just gets moved next to the return so we've just pulled that feature vector inside the brackets that's all then we rearrange and we basically know that this expectation is just one expectation minus a different expectation so let's move the minus part to the other side and then we see that this should be equal to that when the gradient is zero now all that's left to do is to take that inverse of that matrix on the left hand side which therefore is assumed to exist otherwise this step is not valid and then we get our solution so when is that step valid well so there's a distribution over states there right and there's an inner sorry outer product of feature vectors so when is this typically valid it's typically valid if your state space is large and your feature vector is small that's like one case in which it's valid because then assuming at least that your features are non-degenerate so you have different features for different states then if there's at least as many different feature vectors that you can encounter as there are numbers in your feature vector as there are like components in your feature vector then this inner product typically exists so this is the common case essentially like typically you will have many more states and you will have features that do change when you go to different states and then this sorry this outer product the expectation of that will typically be invertible but it is a technical assumption otherwise uh you're not guaranteed to converge it might actually still be that the algorithm still converges but it might not converge to a point it might instead convert to a line or a plane if that's if you can't invert that matrix it depends a little bit on the technical details so the agent state here does not have to be mark off this is fully defined in terms of the data that's kind of interesting right so we don't have to have a markovian feature vector or anything like that basically the fixed point here is defined fully in terms of the x t's it's not even defined in terms of the agent state it's defined in terms of the features that you observe so that's kind of interesting we didn't need that property now we can maybe see if we can do the same for td so let's consider the td fixed point so i'm going to argue here that this converges to this different thing look at it for a second and i'll go back to the previous slide so the monte carlo thing was this outer product the expectation thereof inverted times the return times your feature this looks quite a bit different there's a different outer product here this is still an outer product but instead of being the other product of the feature with itself it's the other product of the feature with itself minus the next feature that matrix is inverted if it exists under similar assumptions as i just mentioned before it would exist it typically exists but doesn't have to and then times the reward times your feature factor not the return but the reward so this is a different fixed point it doesn't just look different it typically actually is different from the monte carlo fixed point so now we're going to verify again as we did from the monte carlo case that this is indeed the fixed point so we're going to consider the td update and we're going to assume that the expected update is 0 where the expected update is given as this this is similar to what we're doing from monte carlo so we're basically just considering the update we're saying oh what is this update zero and then we're just going to manipulate it in a similar way here we're again going to pull in this feature vector first and we put it next to the reward on one end the step size here is actually not that important though i'll get rid of rid of it in a moment so we're going to put it next to the reward and we're also going to put it next to these two scales but we're going to put it on the left hand side which we can just do because these are just numbers and then having put that on the left hand side we notice that on the right hand side we have these wtd so we can also pull that outside of the brackets which means we are left with this gamma times x t plus 1 minus x t both of which are transposed because they were multiplying with this weight vector right okay so now we can again rearrange things so let's pull this thing to the other side so there's a plus here so to put it to the other side it must become a minus but we actually fold that minus inside this little term over here so note that the order has changed that's because there used to be a minus in front and we just pushed it inside and instead flipped the order in addition let's just pull out the transpose here let's pull out the transpose outside of the brackets so we just first subtract one vector from the other and then transpose it to come up with this other product and now finally we move this part to the other side by multiplying with its inverse again assuming that that exists and we notice that because the the learning rate is inside the expectation and we're inverting for simplicity for instance consider here a constant step size just for argument's sake but you can get rid of it in other ways we can see that this step size will actually cancel out with that one because moving this one to the other side by multiplying with the inverse kind of means we're multiplying with one over the step size which cancels out with this step size so we get the fixed fouring becomes this one so this is kind of just verifying that the fixed point is indeed the one that is listed above and we have like substantial proof these days that this is indeed where td converges if it converges it's important to note that this differs from the monte carlo solution and typically the monte carlo solution is actually the one that you want why well because the monte carlo solution is defined as minimizing that thing that we actually want right this is actually minimizing the difference between the return and the value this is sometimes called the value error or the return error and this is an unbiased the gradient of this is unbiased with respect to the using the true value instead of the return as well but actually one could also argue that we don't actually even care about a true value we just care about having a small error with respect to the random returns that's also valid so this seems to be exactly what we want in some sense and then if td converts it somewhere else this might be undesirable in some sense and this i think in general is quite true that monte carlo typically converges to a better value than temporal difference earning does when you start using function approximation however temporal difference learning often converges faster and of course we can trade off again we can pick an intermediate labda or an intermediate n and kind of get the best of both worlds you could even consider changing this over time you could start with a small end and then slowly increment it over time and in the end in the limit maybe still come up with the monte carlo solution if you draw if you'd like um so there's a bit of a trade-off here by it's variance trade-off again similar to what we had before but now td is biased asymptotically as well in the tabular case this is not the case in the tabular case both monte carlo and td find the same solution but if we add function approximation even just for linear function approximation td remains biased which means that it actually finds a different solution in the limit so it does converge if it converges to this fixed point and we can actually characterize it a little bit farther by considering the true value error which is defined as this difference between the true value and our current estimate weighted according to this distribution over states where here we use d pi to denote that this distribution depends on our policy and then monte carlo actually minimizes this value error completely but td doesn't but we can say something about where td what td reaches and we can say that the true value error of td is bounded in the following way where it's small or equal to one divided by one minus gamma times the value error of monte carlo which is equal to the minimum value area that you can get under any weights so what is this saying well this is basically saying that td is doing maybe a little bit worse asymptotically than monte carlo it's doing um something different right it's biased but we can bound how bad it is now how bad is that well there's this term one divided by 1 minus gamma how big is that well this depends on gamma obviously and there's this useful heuristic that we briefly talked about earlier in the course as well where you can kind of view this as a horizon where if gamma is 0.9 for instance then 1 minus gamma is 0.1 and then 1 divided by 0.1 is 10. so if your gamma point 9 roughly corresponds to a horizon of 10 so that means that in that case if your gamma is 0.9 the monte carlo error can be 10 times smaller than the td error but it looks right it can be at most 10 times smaller because this says that the the error that td reaches is smaller or equal to this thing right so it could be of course in some cases you can have feature representations for instance for which td actually happens to find the same solution as monte carlo this is just saying in the worst case if for instance your gamma your discount factor is 0.9 in the worst case it could be 10 times larger doesn't mean it necessarily is 10 times larger now you may have noticed that i already said a couple of times if it converges so there's a slight well but potentially big problem here there's a fundamental problem and it's a really interesting one and it's related to the following fact that the temporal difference earning update is not a true gradient update and this is due to this it's highlighted a lot lightly in the slide here to this bootstrapping because the bootstrapping uses the same parameters as the parameters that we're updating with this update but we're kind of ignoring that in the update we basically just plug that in as a proxy for the real return and hope for the best in some sense but i'll show you an example where this can go wrong normally this is okay it's okay when i say this is okay i mean it's okay that it's not a gradient update sometimes people people like things to be gradients because we understand gradients quite well we know stochastic gradient descent converges so if we have something that's not a gradient it sometimes is of some concern so it's good to appreciate that there's this broader class of algorithms called stochastic approximation algorithms which include gradient algorithms stochastic gradient descent as a special case but that's not the only special case that necessarily converges and indeed temporal difference learning can converge to its fixed point so stochastic approximation algorithms are a broader class and there are specific conditions under which these do converge stochastic gradient always converges under fairly light assumptions such as that the noise is bounded the step size decays and your data is stationary and so on but temporal difference learning does not and now i'm going to show you an example in which it doesn't so this is an example of divergence of temporal difference learning where we have two states and our feature vector has exactly one element and this feature vector has an element equal to one in the first state and the same element is equal to two in the second state what this means is it is depicted above the states is our value estimate in the first state is just w because it's it's just this number w which is just also a single number because our feature factors also also only has one element right and we're doing linear function approximation here so our value in the first state is w and the value in the second state is 2w because it's just the same number w times your feature which is two the reward along the way is zero it's a simple problem but now we're going to consider it's it's a partial problem right i didn't say what happens after you reach uh states where the feature vector is two i just go i'm just going to consider this one transition and then let's see what happens if we apply td so let's step through that let's just first write down the td equation this is our update for the for the temporal difference sorry temporal disorder update for our parameters of the value function where the value function gradient in the linear case is just our feature and in this case the reward is zero the value of the subsequent state s prime is the value of that state on the right which is 2w for whatever w is currently and the value that we're updating so the value of the state that we came from is just x times w where x is 1 so that will just be w so we've just filled in these numbers here this 0 goes over here this 2w goes over here or maybe more precisely this 2 goes over there and then the w comes in because we want the value estimates and then this one goes over here so there's 2w minus 1w with a discount factor as well and we can slightly rewrite this by pulling out the w and getting rid of that 0 which we don't really need to consider so then it turns out that this update looks as follows where we have our previous weight estimates plus your learning rate whatever that is times two minus discount minus one times whatever your feature vector is sorry whatever your weight factor is your weight number which only has one number here but what does this mean well let's consider that your weight is positive let's say you initialize it at one and let's consider a discount factor which is larger than a half if the discount factor is larger than a half this term within the brackets will be positive if your weight is positive and this term within the bracket is positive and again the weight here is positive all of this will be positive and in fact this whole term here on the right will be added to the previous weight which means our new weight will be larger than the previous weight but the weight was already positive so we're moving away from zero note the true value function here would be zero like the true value of that first state might be zero at least it's consistent with this with this transition for it to be zero you can expand this example to actually have a fixed point of zero but in this case it will just continue to grow and your weight in the limit will actually reach infinity what's actually even worse it tends to grow faster and faster because your weight keeps going up and this term multiplies with your weight so this is this is an example of a problematic behavior of temporal difference learning and now we can dig into why this has happened when this has happened how can we characterize this and how can we maybe avoid this so what's happening here is that we're actually combining multiple different things at the same time we're combining bootstrapping we're bootstrapping on that state value on the right we're combining off policy learning and i'll explain that a little bit more and we're combining function approximation and when you combine these three things together you can get these divergent dynamics this is sometimes called the deadly triad because you need all three for this to matter for this to become a problem the off policy learning i promised i would say something about that why did i say this is off policy there's there's no policies here there's no actions here so what's happening here well there's a certain dynamics and your dynamics would give you get you from this one state to the other state but then what we're actually doing is we're going to resample that transition over and over and we're never going to sample any transitions that come from the states that we've entered just now right and that is of policy in the sense that they're actually in this case there are no policies that would sample the states over and over again specifically that way so what we're doing instead we have an off policy distribution of our updates this can emerge why this is called off policy is this can emerge if you're trying to predict the value of a certain policy while following a different policy for instance it could be that the value from this state that there are actions but we're just not interested in their values and hence they get kind of zeroed out in your update in some sense now we could also consider on policy updating by extending the example to be a full example and now we consider this second transition as well and let's for instance assume that this transition just terminates with a reward of zero and then maybe you get transitioned back to this first stage so maybe you just see this episode over and over and over again but now we sample on policy which means we actually sample the transition going from the second state as often as we sample the transition going to that state if that is the case turns out the the learning dynamics in general are convergent td learning will converge also with linear function approximation if you're updating on policy i will show that here just for the example where we can just write out maybe the combined update for simplicity so let's just consider this transition and let's immediately consider that other transition and let's just write down both of those updates where the second transition bootstraps on the terminal state with value zero also has a reward of zero along the way so this will then basically only retain that value of that state and if we merge these together we get this update to our weights and note that the term inside here is now negative for any discount factor uh smaller or equal to one which means that if your weight happened to be positive it would push it down if your weight happens to be negative it will push it up and that is exactly what we want in this case because the optimal weight here is zero just in this example right not generally the case but in this example it's zero and we can see that the dynamics are such that indeed if you're larger than zero you'll get pushed towards zero if you're below zero you will also get pushed towards zero so this seems to be working and indeed it has been proven in general that on policy temporal difference voting with linear function approximation will converge to the optimal policy that's already the optimal values in terms of its fixed point right so the fixed point of td is still different from monte carlo but if we consider that fixed point to be the goal it will find that fixed point in the on policy case as i mentioned the multiplier here is negative for any discount factor between zero and one okay so off policy is a problem on policy fixes it let's dig into one of the other aspects i also mentioned function approximation as one aspect of this so let's see what happens if we don't function approximation so we can consider a tabular representation where in each state we do have a feature vector i've basically just denoted it as a feature vector but it's one hot picking out exactly that state if we do that this is just regression and this will also converge the answer might be suboptimal but no divergence occurs so again td might have a different fixed point in the tablet case it doesn't actually have a different fixed point in general it can have a different fixed point but we could by playing with a representation we could go from divergent dynamics to conversion dynamics that's the point here tabular being a special case so we can still updates of policy now importantly right we don't need to update on policy so we can go back to just updating the first state if we would do that then the value of that state will convert to whatever the discounted value of the next state is and we would never update that value so this is a different way in which you can have a different answer because we're considering off policy learning of course with function approximations can also interact but even in the tabular case it could be that because you're learning off policy that you simply ignore certain states and therefore never update their values which might have lingering effects in the tabular values of the other states so that's just like something to maybe be a little bit aware of that this can still affect where your solution is but you won't diverge you'll still convert to something so that works too we had off policing is on policiness we had function approximation or not and now let's go to the third one what if we don't bootstrap or what if we bootstrap less because i said there were three off policy learning function approximation and bootstrapping now let's look at the bootstrapping and one way to do bootstrapping is to use multi-step returns so let's consider now to have a lab return we're still considering off policy updates right so we're only sampling this one transition over here in some sense but instead of calling it just sampling that one transition let's say it a bit more precisely we're only updating this first state we're never actually updating the second state but now let's consider taking a ladder return from that state instead of taking a one step return this is again something that can happen in some sense in the wild enough policy learning and then we can just write that out we have the laba return here which we know bootstraps slightly on that first state and also slightly continues but it continues to v s double prime which in this case is this terminal state so v prime sorry s prime here is this second state and s double prime is then this terminal state so now we can fit in all these numbers that we know we know that the reward on every step is zero so r is r prime is zero and also the value of this state after two steps will be zero we can just fill in all the numbers that we know and we get this thing and now we can analyze that thing so it's very similar to what we had before but before labda was basically zero so before when we were talking about the divergent example we had two gamma minus one now we have two gamma times one minus lambda minus one ah we have a parameter that we can play with we can play with this lava parameter turns out we can always make sure that this multiplier is negative which we need here for conversions if we pick the labda parameter in a specific way so what we want is essentially that this whole term is negative which means that we want this term to be smaller than one then we just rearrange move things around and it turns out the condition for that will be that your ladder is larger than one minus one over two gamma just to be clear this is not a generic finding this is just specifically for this example this this relationship we're just going through the example so for instance we could pick a specific discount we could fix it discount being 0.9 then turns out the lab the parameter that we need to ensure that we have convergence is that you need to pick it larger than four over nine you can just figure that out by plugging in zero point nine here and then just doing the math and figuring out what to discount sorry what the trace parameter lab that should be this is not a hugely large trace parameter typically people pick things like 0.9 or 0.8 or something like that so 0.45 is not an unreasonable value to have if you pick it larger than that you're good to go and you won't diverge so these are important things to keep in mind and one thing that i want to highlight by going through all of these things is that it's actually it's not always binary so the deadly tribe might make it sound that if you combine bootstrapping function approximation and off policy means that you're always in trouble that's not actually true you can combine bootstrapping function approximation and of policy learning as long as you're not bootstrapping too much you're not too off policy and your function approximation is tabular enough in some sense so what turns out to be the case is that people run these algorithms in practice quite a lot and a lot of deeper algorithms actually combine all these three things they combine bootstrapping of policy learning and function approximation where the function approximation is for instance deep neural networks you have policies for instance because we try to predict many different policies or because we're trying to learn about past data and things like that or because we're just doing q learning at scale which is also an off policy algorithm and then the bootstrapping is also because fringe is doing q learning or doing something similar but also because of this bias variance tradeoff that we typically find that it's better not to bootstrap too heavily but also not to use full monte carlo returns so we're combining all of these but turns out many of these algorithms are actually stable and they do learn well because we're not we're still in the triad but we're not in the deadly portion of the triad an alternative that you could consider is to pick a different loss so let me briefly go into there so temporal difference learning has a specific update right and i mentioned all temporal difference learning can diverge if you're doing uh all of these things at the same time but why do temporal difference learning can we find something else maybe okay so the problem as mentioned from temporal difference learning is that it ignores essentially that we're bootstrapping on a value that itself depends on the weights what was happening in the example that we had just now going from a state with value w to state with 2w was essentially that we're trying to update this state closer to the value of that next state but in doing so we're actually updating the value of the next state more away from us so you're chasing your own tail except your tail is running away from you faster than you're running towards it that's why td in that case diverges so an alternative would be to actually define something that is a gradient a true gradient one example of this is the bellman residual gradient algorithm where we define a loss which is just our square td error so this is a different loss right we don't have our normal value loss here we don't have a normal value error we instead say well let's just consider the temporal difference error and let's see if we can minimize that if you push the gradient through that and you calculate that you find something that looks very similar to d but had to td but has a different additional term unfortunately this tends to work worse in practice and one reason for that is it smooths your value estimates where stem for difference learning in some sense predicts so what do i mean there intuitively temporal difference learning updates the value of the state towards something that is in the future towards this one-step estimate of your value in the future the reward and the value the next state this algorithm instead it's a fine algorithm it can be used and it's convergent in as much as sample difference learning sorry as stochastic gradient descent is convergent but it does something that is maybe a little bit weird in the sense that it updates not just the value of the state that we came from but it also update the value of the state that we're bootstrapping on so in that sense it's a smoother because what it might do is actually it might pull downs for instance the value of the state you're bootstrapping on just to make it easier to predict it but that doesn't mean that's a valid value for that state so it turns out if you run these algorithms in practice oftentimes it just works worse that's maybe not a necessity maybe it's not necessarily worse but it's basically seems to be that this loss although we can define a gradient of that loss which looks somewhat similar to the td algorithm that maybe the loss is just not the right thing to be minimizing it might not be the right thing to look at so smooth values are not just bad because your value accuracy might be slightly wrong but it might also actually lead to sub-optimal decisions this is especially true if your rewards are for instance sparse because then there might be this one rewarding event and immediately after getting the reward your value function maybe should drop because you've already received that reward but if you do something like bellman residual minimization what might actually happen is it tends to smooth over that discontinuity in the value function and hence the states immediately after already taking the reward might be updated to look a little bit like they might still be able to take some of the reward which might lead to wrong decisions so let's consider a different alternative let's instead of squaring the td error inside the expectation let's just take the expected td error and try to minimize that if we minimize the expected td error maybe we're still kind of good to go from a prediction perspective and instead of so so let me just go back a slide let me show the one difference here is where this query is right so here in the bellman residual algorithm the square is inside the expectation here when we consider the bellman era these squares outside of the expectation again we can take the gradient of this but there's a catch and the gradient looks very similar to the previous algorithm but there's this weird s prime there s prime is a second independent sample of s t plus one so s prime t plus one needs to be sampled separately if you don't sample it separately you're actually doing this other algorithm that we talked about before where st plus one is the actual next state that you've seen so this algorithm can only be applied if you have a simulator that you can reset you should be able to sample the transition twice essentially and then you could apply it but that's an unrealistic assumption for many cases if you're a robot in the real world and you're just interacting with the world you can't resample the transition twice so then this uh this algorithm becomes less feasible here's just a brief summary of what we were just talking about there's a couple of different categories so you have your algorithm which could be monte carlo or it could be td you could be on policy or it could be off policy and you could pick a certain function class now in terms of just summarizing the convergence properties of these algorithms if you're on policy your message is good to go the only thing that's not 100 guaranteed is non-linear conversions but even for that one we can say some things actually and in some cases you can guarantee that td converges for the non-linear case but it requires more assumptions so we can't say in general it always converges if we go off policy we see that linear function approximation also becomes problematic for td although again i want to stress that this doesn't mean it's always a problem it just means there are cases in which we can't guarantee convergence doesn't mean the algorithm doesn't work in general just means that there are educators in which you could do something wrong this has been addressed in more recent work that is too detailed to go into in this lecture where people have devised new algorithms that are slightly different from td but might be inspired by td that are actually guaranteed to converge also with function approximation also under off policy updating but those algorithms are beyond the scope of this this lecture and this course section okay so now let's move towards control so now that i've maybe expressed some uncertainty about convergence but also some evidence that typically does converge so we can extend all of these control algorithms obviously to function approximation the theory of control with function approximation is not that well developed unfortunately this is because it's substantially harder from a prediction case because now the uh the policies under direct control this might change over time and it might be quite hard to uh reason through maybe the one exception of this will be in a subsequent lecture when we talk about policy gradients which are again stochastic gradient algorithms and therefore are in some sense fine but the theory of control with function approximation when you use value estimates is much harder in some sense in addition to that actually we might not even want to converge that often or during learning especially when considering this value learning because you might actually want to continue tracking so for instance if you're if you're doing something like value-based control you your policy will be changing so your predictions shouldn't converge because they want to convert to whatever the current value of the policy is but if the policy keeps changing you actually want to track that rather than converge but also more generally even if we're doing direct updates to the policy it might be preferable to actually not converge but just to continue to adapt that doesn't mean that convergence guarantees are not interesting or not important because one thing that we could still have is if we have a guarantee of convergence that means that the algorithm kind of updates stably typically throughout its lifetime whereas if you can diverge you only need to diverge one once for everything to fall apart obviously so understanding when things converge and diverge is still very important even if you can't fully characterize it okay now moving on we're going to consider updating these functions in batch settings so we've talked about batch methods previously but there's a different reason we're going to talk about it now previously when we talked about that's reinforcement learning this was basically to highlight differences between temporal difference learning and monte carlo learning but we weren't proposing it as an approach that you should necessarily follow although we didn't mention that you could do that now we're going to take the other view which is oh what if we really want to learn um with batches of data and a reason why you might want to do that is because well gradient descent is a simple and appealing algorithm it's not very simply efficient the reason why intuitively is that every time you see a transition you're going to consume that transition and make maybe an incremental update to your value function and then you're going to toss the transition away but there might be a lot that you could have learned from that transition that you can't immediately extract it doesn't play a role immediately in the update that you're doing and then that would be a waste in a sense so instead we could consider to doing a batch approach where we try to find the best fitting value function for a set of data that we've seen so far so instead of just doing one gradient update necessarily on every sample we could try to maybe extract as much information as we possibly can from the training data there are several ways to do that i just want to mention that if if this seems at odds with our main goal at the beginning where the world is large your lifetime is large and consider a robot walking through the world maybe you're thinking well that robots can't store all the data this is true i and i agree with that viewpoint so full batch settings are maybe most typically employed when the data is not too huge but turns out you can also employ them in practice by for instance storing only a limited amount of data or alternatively this will actually also be related to model-based approaches where you can consider storing some of the data to be quite similar to storing an empirical model so now it's just a choice of what your model is what the structure of your model is is it a parametric model or is it a non-parametric model that consists only of data that you've seen so there are approaches that feel quite similar to these batch approaches that do scale to really large problems where maybe you couldn't store all the data but for now for simplicity and clarity we're going to focus on this case where we really are going to consider all the data and we will actually see that there are some methods that don't actually need to store all the data to get the same answer and we're going to start there by considering again the linear case so we have linear function approximation and we were considering fitting the best possible fit of our parameters so we talked about this before in the limits for td so let's consider that td update again and what we said is oh if you consider the updates to have converged so that the expected update is zero then your fixed point will be this this weight vector which we called wtd which is as defined on the slide this is the same as we saw earlier now one idea that you could do is well what if we take that expectation and instead of taking the expectation over all possible states and all possible transitions that could happen let's just take it over the things that did happen so what we'll do here are add some time step t and we're just going to instead of taking the full expectation which we don't know for this we would need the uh knowledge of the mark of decision process instead what we're going to do we're going to take the average of the things that did actually happen so this is very similar and in the limit of course this average if time grows to infinity if your distribution is stationary which needs to be the case for this expectation to even exist then this summation this equation actually will converge to this equation above by law of large numbers assuming of course that the variance of your reward is finite and so on so the idea then is to maybe instead of doing this where we just use this for analysis we are going to do exactly the same things on the empirical loss and that turns out this will look as follows where we see something very similar as we did for the td fixed point so for the td fixed point let me walk you through that first we had the expectation over an outer product of two vectors so the first vector is just a feature vector and the second vector is your feature vector minus the next feature vector discounted and this is an outer product so this is a matrix and the expectation over that will be a full rank matrix typically under some mild assumptions so we can invert it and then we multiply that with a vector so we have a matrix number of features by number of features times a vector number of features and that will give us a weight vector this is also the number of features large which fits with the linear function approximation of course now here we're going to do exactly the same thing but instead of having the expectation we have summations note that we're not not taking averages because if we would put like a one divided by t here and we would put one over here they would just cancel out so we can ignore that we can just consider the sum of these other products and the sum of these vectors over here and that will be equivalent to taking the average instead so the summation is over exactly what you would expect now which is this outer product of these feature vectors but it's not for the expected feature vectors it's for the ones that you've actually seen and then again here the same thing happens this is called least squares temporal difference learning and the solution here this weight vector is the least square solution given some data so we can put that equation here at the top of the slide where i made one change here in the previous slide i called it wlstd and now i'm just going to call it wt because in fact we can change this as we get more data wt can change because it's just a data dependent weight and indeed we could be interested in these predictions for uh any time set during learning right so there might be reasons why you might want to extract this prediction for instance for reasons of control you might want to pick actions although that's a bit more subtle to do to combine with least question for difference learning because in some sense you don't want to consider all the data then because these this is collected with overpass policies so for simplicity now consider the prediction case but it could be that you still want to know these value functions as you go so you might want to have this weight vector you might want to compute it you get some new data you want to recompute it now unfortunately we can do that online so before we do that let's just give these things names we call the matrix here inside the brackets a so that this whole thing will be a inverted so a t is just this summation of outer products bt will be the summation of the feature vector times the reward so these are the names we give it and then the weight vector at every time step is defined as the inverse of a t times bt now we can update both of these online so we're considering actually storing both these separate quantities and we're going to update a separately from bt and then recompute wt whenever we need it one way to do that the naive approach and maybe the most obvious way to do that would be to update the a matrix by just adding the current outer product the one you've just observed and adding the reward times feature vector to the b vector unfortunately the reason i call this the naive approach is that it's fairly expensive it's not expensive because of the b update that one's very cheap because that one just has the same number of operations as you have features x here is just a feature vector b is the same size as a feature vector we're just adding two vectors together and this gives us our new vector the update to a is more expensive because we're adding an outer product which is will be our other products of these feature vectors will be the same size as a matrix by features by number of features and we're adding this to this a matrix this is also the size of the feature squared so adding these together and giving us our new a matrix is n squared where n is the number of features so that is also not the most expensive operation here the most expensive operation is then that when we want to compute w we need to invert a and inverting a naively will give us a cubic cost because a will be some dense matrix and inverting a square matrix of n by n will be cubic in n so that's expensive and we want to avoid that and the reason we want to avoid that is because we want to ultimately consider large feature vectors so if you only have 10 features you could apply this approach right then the algorithm would cost like 100 computations to compute your weight vector sorry a thousand computations updating b would only be like 10 operations order 10 operations updating a would be order 100 operations but an inverting a would be order a thousand operations that's still okay but if you have a million features this becomes very uh very wild very quickly and it's not very very scalable approach um so we'd like to do something a little bit cheaper turns out you can do exactly the same thing cheaper by updating the inverse of the matrix a instead of the matrix a itself and this is called a sherman morrison update and sherman morrison is a more generic approach where we can take any matrix and then add any outer products and then consider the matrix that results from that operation and then turns out the inverse of that matrix can be computed in only squared number of operations if you knew the inverse of the matrix that we had before so we just we we always have this because we're incrementally keeping we keep them updating it so we just start for instance at the um you can start for instance at um something simple like the identity and then you could just incrementally keep on updating like this so the operations here are um as you can see a matrix by another matrix here's an outer product by another matrix of course you could also first compute the dot product of the vector with the matrix turning that into a vector and then you still have the same happening here so then you'd still have an outer product left the thing below the division bar is just a scalar because it's a vector times a matrix times a vector plus one so this is a it looks maybe a little bit complicated there are many proofs of why sherman morrison works online feel free to dig into that if you didn't see this before the update to b remains the same we don't need to invert b it's just a vector you can't invert that and this way we would be able to compute this w this weight vector at the top of the slide by incrementally updating the inverse of a rather than a itself this is still quadratic in the number of features and it applies only to linear function approximation but it is it is an approach that scales at least to some larger numbers than the cubic approach it is more computed than td so in large applications this is still avoided for two reasons one is that it's limited to the linear case as i mentioned and the other one is that you might still want to have very large feature vectors even if you want to do linear and if your feature vectors are a million size a million then still squaring that is still fairly large so it might still take quite a bit of time on every step however it could be that this is feasible for your application and then it's it's good to keep keep it in mind the other reason why we talk about this is because it can be an inspiration to other approaches so in the limits it's good to appreciate that lcd and td converts to the same fixed point for lcd this is almost immediate let me just go back a couple of slides to show that because as i said by the law of large numbers this thing will just become that thing which means that the solutions will also become the same so lcd kind of immediately will converge to the same fixed point std under mild assumptions and we can extend lcd to multi-step returns as well we could consider lstd lab that instead of just considering the one-step lcd video we discussed just now it can also be extended to action values maybe quite naturally you consider you can always almost always extend anything that's defined for state values to state action values by considering sarsa-like approaches and then of course we can interlace it with policy improvement if we do that we could have least squares policy iteration where we just use lstdq to estimate action values for the policy evaluation phase and then we greetify the algorithm of course when we do that when we change the algorithm we need to be a little bit careful that we then shouldn't use all the data collected so far for computing our value estimates one approach one simple approach could for instance be that you just restart the policy policy evaluation phase whenever you've made a policy improvement step throwing away the past values not trusting them anymore and just recompusing them for the new policy okay now we're going to move on to maybe what is maybe a more generic approach but it's similar to lscd in the sense that we're going to consider the data collected so far so we consider that we have some experience a data set consisting of all of our experience up to some time set t now one thing we could do is instead of considering all of these transitions at once as we were doing with lstd we can just sample transitions from this data set repeatedly so we sample transition that's enumerator with a with a time step n which is smaller or equal to the current time set t and then we just do a gradient update with that so if this data was collected under certain policy and we're interested in evaluating that policy this seems like a fine approach you can actually also combine it with q learning and i'll show that in a moment to learn off policy the benefit of this is that you can reuse old data so you could just store all the data you've seen so far and keep on updating which means you can take multiple steps on the same multiple gradient steps incremental gradient steps on the same transitions over and over and thereby extract more knowledge that you maybe could do from a single gradient step this is also a form of batch learning it doesn't consider all the data at once it doesn't consider the full batch at once but instead it considers the batch to be able to sample from it of course you have to be a little bit careful that if your policy changes this might be important because if you're just trying to do prediction you don't want to use transitions for a policy that has uh has now changed or alternatively in a future lecture i'll talk about how to deal with that and how to still be able to use it but to re-weight it appropriately ok this brings us to the exciting topic of deep reinforcement burning i will touch upon that only briefly now there will be much more about this in later lectures so i'm going to start by just doing a very very brief um kind of a recap on neural networks in general this might be very familiar to many of you already but it seemed useful just to very briefly go into that a little bit so let's go to the uh note-taking app that we've been calling a blackboard and let's consider what deep neural networks essentially are as far as we're concerned there's a lot to say about this and of course i won't be able to say anything remotely scratching the surface but let me just give you the very bare minimum that that might be useful for us to understand for the purposes of this course so what have we been considering so far let's first start with the linear case and we were considering to having a value function that is defined as an inner product of some weight vector w with some features of the state now this is a linear approach which has benefits and downsides as discussed before what would be a non-linear approach well for instance we could have some function let's say f that takes some number as input and outputs a real number and we could apply that we could say that any such function um actually let me make it clear that these are both just numbers so any such function let's just assume that without bothering if we apply this to a vector then it will be applied element-wise so the idea is that we have some function which is non-linear popular examples of this include let me just draw a couple squashing functions so if this is the zero lines of a graph then the squashing function might look like this or another popular choice these days is what they call a rectifier linear unit which has a discontinuity that looks like this where essentially um it's this this one is just defined as f y is max zero y and the benefit of this is that it's linear in some parts but it has a non-linearity so you can use this to add non-linearity to your function if you want and i'll show you in a minute how um this one's often called a sigmoid or the one i actually this is more general term the one i drew here or kind of tried to draw here kind of is a 10 inch um so that's those are just specific functions and the idea is that this is some nonlinear function and then one way to make that value function on linear is just to apply that function so we could consider something like this where we have a function applied now this is not particularly useful yet right but it's already non-linear and if f is differentiable then we can still compute its gradient with respect to the weights now this is not very useful immediately why would we necessarily for instance squash the function between 0 and 1 as we do with the 10h or why would we say oh negative values are not allowed as we do with a rectifier so this one's called a rectifier linear unit no so this is not typically what we do so let me immediately correct that instead what we might do is we might say well let's first apply some matrix to our features or inputs sorry we don't need a transpose there so i'm starting the neural transform and then we multiply this again so let me call this waitress matrix 1 and then we have maybe a vector 2 and this could be our value estimate so now note that if we want to take the gradient so the gradient with one of the components of w uh 2 will be quite simple so if we consider the partial derivative of vw of some state s with respect to w to i this will simply be the function f applied to this weight matrix applied to our feature vector taken at i so note here that this weight matrix multiplied by the feature vector the feature vector is a vector the matrix is matrix so the output of this is a vector so f is applied element-wise and we're basically going to subscript this with an i to denote that we're taking the i input uh the output here there's different ways to denote that i could have also written this for instance with uh index notation like that so that's very similar to before essentially as far as we're concerned now in terms of the derivative with respect to the weights in w2 these are just features so our features are now defined as some function f applied to some weight matrix w times the feature vector and those are as far as we're concerned just our features but the difference now is that we can also consider as part of our update the gradients so the the partial derivative with respect to some element in matrix one so let's consider for instance i and j so it's a matrix right so we need to index twice to get one element and this can be computed just by the chain rule so what we'll get is essentially that will have the gradient of w vw by the intermediate thing so we take the gradient of the intermediate thing being um this one taken at i and then that thing the gradient with respect to that weight but let's do an intermediate step let's take it first just to the inputs of that function and then finally we have so let's unpack this a little bit so this part this first partial derivative is just the gradient of our value function with respect to the i feature in some sense this will just be w2 at i this next part is just the derivative of our function whatever that is and these functions above they always have some well-defined derivative the radius doesn't actually have a well-defined derivative at exactly zero but that's okay we just define it to be zero there so we extend this to have a well-defined derivative at zero so that the derivative at least is not contained not discontinuous sorry let me move down again so this will just be the derivative taken at whatever those values were and then this last bit the weight matrix times the feature vector by that one element will simply be feature vector indexed at that one element so these are all just numbers now sorry this was also taken at i so note that actually what we're considering here so i've indexed here let me use a different color to denote that here i've indexed with i which i did here as well i could also have written that as this like the f is applied to the weight matrix but indexed at i so let me just say i dot so now it's a vector so that's a different way to write that same thing so essentially whenever i index here um so let me actually pull that one inside let me get rid of that one over there and let me instead put it over there to make it clearer that we're actually indexing the weight matrix now note that the things that we're left with then at the bottom they're all just numbers this is a number that is a number that is a number and we just multiply them together to get that one number that we needed for our gradient then our gradient is just this one big vector which merges both the contributions for the second weight vector and for the weight matrix and we can continue stacking this now i mentioned before it's useful to go through these steps to apply the chain rule a little bit but you don't have to actually worry too much about that in practice these days because what we instead do we just whenever we need a gradient we apply a software package which will automatically compute that gradient so you can make these functions arbitrarily complex so one thing that you could do is for instance you could have a value function that has some weight let's say at layer l times the feature sorry so this is just some function which we could also super subscript l where this is a layer index and then this will be times the inputs let's call those y at layer l and then we could just say oh the inputs at layer l are defined as some weight matrix l this is a vector so now we have a matrix of l minus 1 times the features at the previous so we can stack this fairly deeply just expand it for one specific example let me just write it down explicitly because it doesn't hurt to uh have that in mind for instance we could have something like three layers let's say and the more layers you add when we use terminology in deep learning we say the deeper your network is this is because you can denote this this function you can denote that as there being an inputs x and then there's a weight matrix here and then we get our let's say that like let's give this a name sorry ran out of space there on the on the end let's give this one a name i1 then this could be i2 and then you could do that again i3 and then maybe then so we could denote notice as well by showing that there's lots of connections here this could be a dense matrix operation and then in the end maybe it goes into a single number where this is weight three so the idea is that you can stack these people use all sorts of notations to denote these neural networks sometimes as simple as something like this or sometimes people draw more complicated figures which show the dimensions a little bit better as you can imagine you can make these things fairly complex and weird looking as well so you could have all sorts of weird structures which merge in strange ways because maybe you want to somehow put some inductive bias in the structure of your value function this can be useful in some cases is this is a way to put for instance some more prior knowledge into your function class if that is helpful it could also be that you have multiple inputs that maybe come in at multiple places for instance you could have vision inputs and auditory inputs and maybe division input has a certain pre-processing necessarily and then the auditory inputs have a different preprocessing necessary so you could have all sorts of inputs like maybe this is some part of your observation this is a different part of your observation maybe this is yet another part of your observation and they finally all come together to give you your value estimate and then all parameters here in your network so actually let me do that in a different color to make it clear that we're not that this is not part of the network now so oh sorry that didn't actually change my color yes doesn't seem to want to change my color so let's go like that so all of those lines there there are there are weights so we just compute the gradient with respect to those weights and typically these days as i mentioned we use an auto differentiation package to do that so you can basically construct these fairly complicated networks without having to worry about computing all of these gradients by hand instead the software will do that for us and turns out these gradients can always be computed if you're if you're using differentiable functions of course these can always be computed in a compute that is comparable to the amount of compute it takes to do your forward pass so if you can do that as forward pass of this whole network in some limited amount of time and then computing the gradient will take a time that is proportional to that so it doesn't take much longer to compute the gradient okay now that's in a nutshell deep learning i want to say one more thing which is if we then consider time september time step you could consider having a state here this is our asian state now and let's go a time step further to the next aging state and these depend somehow on our observations and maybe agent state is then used in some complicated way maybe they also results from your observation in some complicated way and maybe these are used in some complicated way to maybe get our value and maybe to get our policy and maybe other things as well now this step this agent update function could be considered let me actually put it somewhere else let me put that here because the agent update function in some sense merges that arrow and this part so it merges the observation in the previous state into a new state that's the definition of the agent update function and here we see how we could structure that in our implementations as a neural network where there is now a gradient that might flow actually let me use blue to denote the gradient flow so if we want to update the prediction over here the gradient might flow into the parameters here into the parameters looking at the observation and also maybe through time maybe even further back so there could have been then you use prime s to denote the previous one and the gradient could flow all the way back through time now this is both nice and potentially beneficial but it's also potentially problematic because that means that if you want to actually have the trading flowing all the way through time you need to do something smart about how that gradient flows that's beyond the scope of this uh course that's more for more advanced deep learning considerations there are mechanisms that allow you to do that for instance by just truncating the gradient flow at some times up in the past and say well i'm just going to do a couple of steps that's called back propagation through time that propagation in general is just the term refers to computables those gradients going backwards through these non-linear structures okay so that's just a brief very brief explanation of deep learning some of that might be completely known to you already if you're previous to a deep learning curve course here i just wanted to highlight that this is one way to build these structures and that you can then use them to uh basically compute your value functions i want to mention yet one more little thing which is kind of the same as what i said before which is what conf nets are so the structure that we've seen before actually let me highlight that again is a non-linear network now i'm just going to say a special case of this is when you have vision inputs which are structured sometimes drawn like this where these are kind of the pixels on the screen so maybe you have some sort of a pac-man thing here maybe trying to eat the ghosts so of course normally you'd see much more of the pac-man screen but let's just do a simple example maybe there's a dot here that's eating as well so this could be the pixels of the screen and in fact typically you could have multiple layers of these because for instance you could have an rgb input where one of these layers contains the red one of them contains the green and so on and then what we can do is we can apply a linear layer to this and then do process through nonlinearity exactly as before but turns out if you just put all of those into like you could take all these these pixels and just put them into some sort of a large flat vector and then just do the the thing that we did before but that turns out that doesn't work that well so we typically don't do that so instead what we typically do is we apply a special linear layer which actually looks at a little patch and typically looks at that patch across the channels so these are channels so then it turns this little patch here that goes through the channels in a sense it turns that into a vector and then applies a weight matrix to give us the next patch which then gets turned into a spatial patch again and potentially again you could have multiple layers of those and in fact what we typically do is we don't actually turn it into a patch we just turn it into a so i'm actually going to backtrack that slightly we just turn it into a number so a little one pixel and then having stacked layers of those this is just a vector but i'm stacking it and then we're going to compose those again into some larger image the number of layers here the number of sorry channels here does not have to match with the previous one so we could have more channels here and the semi image in some sense could be smaller this is called the cons net and then what happens in the columns is that actually the same weights get to get applied to different patches so also over here that's called weight sharing in general and commons nets use weight sharing so essentially what's happening here is you could consider this to be kind of a filter and this filter gets moved across the image and basically on every patch in the image it gets applied the same filter and it extracts some feature information which passes it into a new feature layer and then we get something that's sometimes drawn a little bit like this where as mentioned we could have multiple planes here and then this goes into other planes where these planes could have a different numbers let me just draw three this time or four let me just make it as i go along and this one patch here would show up as a single pixel in this new thing but then there's an overlapping patch for instance here that would show up as the pixel just next to it and so on and so on and then we just can do the same thing again and then it could be a new patch here that goes into a layer above there so you could apply this multiple times these are just similar to before these are just linear layers and then we put non-linearities basically just before uh computing this uh this pixel over here so this would still have this non-linearity on top of this because otherwise it's a linear network and everything kind of collapses to a linear function approximator but nonlinear function approximation can be a richer class so instead of necessarily being linear in the observations this could now be nonlinear observation and then you can apply this many many times and then eventually you could toss it into a vector and then maybe compute your value function on top of that so that is all that the content is it's exactly the same except that the linear layer has some certain structure so one way to consider that is that this matrix here at the beginning could be a common layer and that just means that that matrix has a very specific structure where the same weights show up in multiple places in the matrix there's weight sharing and in addition there is some certain specific spatial structure where we're considering owning these patches okay so that's just a brief primer there's much more to be said about this this is kind of a poor explanation because we're only taking a little bit of time to talk about these aspects so i just wanted to give you a flavor of it if you don't fully understand that there's loads of introductory material on deep learning on the uh in the wild and there's courses on this there's lots of stuff on the internet so i encourage you if you want to know more please dig into that if you first want to just continue with the rest of this course without fully understanding all of it that's also perfectly okay okay so then using these networks in reinforcement planning is called deep reinforcement learning fortunately many of the ideas immediately transfer when using deep neural networks we can do temporal difference learning we could do monte carlo we just place the feature vector with the gradient in our updates we can apply double learning if we need it for instance double dqn double q learning we could use experience replay and all sorts of other intuitions and results they do transfer some things don't for instance ucb is hard to transfer because ucb i'm reminding you this is an exploration method that we discussed in the second lecture and it requires counts so you need to know how often you've seen a certain state or state action pair if you want to ucb turns out it's very hard to count with deep neural networks or at least in in the way that we need for ucb because of generalization so deep neural networks tend to generalize really well and this makes them learn really quickly and people are still trying to understand exactly the properties of these networks because it also depends on which optimization mechanism you use because these days people typically don't use vanilla stochastic gradient descent but they use gradient transformations on top of that now if you whether you do that or not it turns out it's just very hard to count with these neural networks in terms of the states and reinforcement learning and therefore it's hard to apply some sort of a bound like ucb does although people have considered extensions that do seem to maybe be somewhat reasonable but it's still active research also least squares methods don't really transfer that well because they're specifically made for the linear case and then you can analytically compute these compute these sorry these solutions but in the generic case for non-linear functions it's really hard to compute those solutions because there's non-linearities involved that means that we can't just immediately write down what the optimal weight vector is for a specific data set to fit it and instead what people typically do is just use gradient methods to incrementally find them so instead of doing these squares people do experience replay so let me now walk you through an example where an online neural q learning agent that combines these approach might include a neural network which takes your observations and outputs say an action out q value so the idea is that this is just a non-linear function that just looks at the pixels of the screen let's say of pac-man and then has some maybe conflares to first do the spatial reasoning then merges that into a vector that maybe has a couple of other layers and then at the end it outputs the same number of elements as you have actions and then we just index that with the action that we're interested in if we want a specific action value now note that i have ignored agent state here so the agency is basically just your observation in this case so this one wouldn't have memory and maybe that's okay for some games like pac-man then we need an exploratory policies for instance we just do epsilon greedy on top of this weight vector that we have so that our action that we actually select will be greedy with respect to these action values with probability one minus epsilon and fully random with uh probability epsilon and then we have a way it updates to the parameters of this network where we use in this case gradient q learning which you might recall the q learning update let me just briefly mention it again it's going to be your immediate reward plus the discounted value of the maximum value action in the next state minus our current estimate times the gradient of that current estimate that would be a gradient extension of q learning i did mention a couple of downsides like overestimation you could plug in solutions here as well this is just for for giving you a simple example then we could use an optimizer so typically what people do is they take this this update according to std this gradient in some sense it's a semi gradient actually because of the way it's also being here but being ignored but but people take this and then they transform it for instance applying momentum so that you take a little bit of the previous update and you merge that in or using more fancy optimizers like rms prop and the adam optimizer those are quite popular these days and then turns out this often works better when using deep neural networks details are not too important for us right now often when i said we were using these other differentiation packages uh often these days so what people quite often do what you'll see in code quite often is they implement this weight update via a loss but it's a little bit of aware of things i'm mentioning this because it makes it easier to understand versus other people's code but in some sense it's a little bit of a strange thing because there's this really strange operator here which is called a stop gradient in most software frameworks so basically what people do then is they say oh let's just write it down as if it's a loss but let's ignore the gradient flowing into this part and then if you take the gradient of this thing but you ignore the gradient going into this part you get that update that we had above this kind of makes it explicit that this is not a real gradient algorithm right we actually have to manually specify oh yeah take the gradient but don't push the gradient into that part and if you do push the gradient into that part you're doing one of these bellman residual minimization algorithms instead and as i mentioned that tends to work slightly worse in practice so you could play with this you could try that out and see if the same holds for you that that works worse and then people implement this as a loss so they can say oh just minimize that just apply stochastic gradient descent on that alternatively of course what you could also do is you could just compute this temporal difference error this q learning temporal difference error here and just compute the gradient with respect to the action values then you don't need that stop gradient operation and then you could also implement it like that what we're basically doing in some sense there is akin to doing one step of the chain rule by hand a very simple first step of the chain rule by hand to construct the update directly instead of first formulating it as some sort of a loss it's not a real loss it's not because of the stop gradients i would i'm hesitant to call that a real loss although people do call these things real losses even if people call these things losses even if they're soft gradients i'm just hesitant to do that because we're not actually following the gradient with respect to the parameters if there are stock gradients involved so this just happens to have the right gradient if you compute this with an auto differentiation package that's just for awareness essentially if you want to implement this and now we can finally move to the dq algorithm so i showed examples of this all the way at the beginning of the course where we saw this agent playing atari games and now we finally accumulated enough knowledge that we could actually replicate that so i'm going to explain what is in that algorithm and it's very similar to what we saw just now so there's a neural network exactly as i said just now the observation here is a little bit special so the observation in dqn is not just the pixels on the screen it's actually taking a couple of frames a couple of steps into the past and also doing some operations on that so the original dqn algorithm would down scale the screen slightly for computational reasons it would gray scale it instead of keeping the color information this actually throws away a little bit of information some recent approaches people sometimes don't do that anymore and in addition to that it would stack a couple of frames so it has some slight history and there are two reasons for that one is for instance in a game like pong where you're hitting a ball from one pedal to the other it was thought that maybe this is important to be able to capture the direction of the ball because if you just know where the ball is then you don't necessarily necessarily go whether it's going no whether it's going left or right in the game of pong that might not actually be that consequential that might not be necessary to know that for picking the right action but in other games you could imagine that sometimes that information is quite useful but there's a different reason as well which is that india's atari games in the simulator at least and maybe in the original games as well i don't actually know the screen would sometimes flicker so certain enemies might sometimes be there on one frame but they wouldn't be there on the next frame so therefore that was important to consider a couple of frames rather than just just one frame because the mem the uh a network otherwise doesn't actually have any memory so a different way to say that is that actually the input here are not the raw observations but there's just a very simple handcrafted agent update function which looks at the observations which could be considered raw frames and then stacks them and processes them appropriately to give you an agent state but there's still no long-term memory of course there's a need for an expiration policy and indeed in dqm this was literally taken to be epsilon greedy with a small epsilon like 0.05 or 0.01 and there was a replay buffer where you could store and sample past transitions this replay buffer didn't store all of the transitions instead it stored like the last 1 million transitions or something like that so at the time when you go beyond 1 million transitions then it would start throwing them away and in addition there's a subtle other thing which was called a target network and this was one of the innovations that made dqn work more effectively so we're going to define a new weight vector that concludes all contains all of the weights of a network of exactly the same shape and size as the neural network that defines our q value and we're going to use that to bootstrap so the weight update function is exactly the same as the online q learning algorithm that we saw on the previous slide except for this one change where there's a w minus here w minus is just a name the minus doesn't mean it's negative or anything it's just a name to say oh this is a different weight vector different from the one that we're updating w so then if we're not updating w minus how does that one update well that one is just updated by copying in occasionally the online weight vectors so what mean what this means is that during learning you're basically regressing on a different value function for a while while keeping it fixed for say 10 000 steps and then you copy in the online weight so that occasionally it does get better over time this makes the learning target a little bit more stable and it was helpful for learning in the qn this might sound a little bit similar and familiar uh when we talked about double learning for double q learning this is not quite double q learning yet because it is still using the same parameters here to evaluate the action and to pick the action it's just using this w minus now maybe something for you to think about oh how would you then implement double learning in this framework what would that look like and then there was an optimizer applied to actually minimize the loss in the original dqm this was rms prop so this update would be tossed into the rms prop optimizer which also stores some statistics of past gradients and then would compute the actual updates by taking all of that into account current day people still use rms proper adam typically or variations of this the replay and the target networks are designed to make the reinforcement learning look a little bit more like supervised learning and the reasoning behind that was essentially that if you make it more look more like super first device learning is more likely to work because we already know that deep learning works quite well for supervised learning for or for regression and classification for supervised data sets the replay kind of scrambles the data so that the sampling seems more ideal if you then if you would uh just give all of the samples sequentially and the target network just makes the regression target less non-stationary on every single timestamp and this might help the learning dynamics neither of those is strictly necessary for for good learning but in the dqn setup they did help performance and they are still sometimes used in modern algorithms as well and sometimes too much effect okay so that brings us basically to the end of this lecture where these last two things are kind of an interesting intriguing like insight into what makes deep reinforcement building an interesting field so turning the replay and the target networks um making help making that make use of that to make the reinforcement planning look more like supervised learning this was done of course because we're combining the deep learning part in so in some sense you could say this is deep learning aware reinforcement learning so we're changing the reinforcement during update a little bit with these target networks and we're using experience replay deliberately here because this makes better use of these deep networks that we're using the converse also exists there are also things that could be called reinforcement learning aware deep learning where sometimes people change for instance the network architecture or the optimizer in certain ways to maybe play better with the fact that we're doing reinforcement learning the fact that these exist that that there are some important design considerations that could be called deep learning aware reinforcement wording or reinforcement learning aware deep learning this is exactly what makes the reinforcement learning a field of its own in some sense the subfield of its own with really interesting research questions that are not pure reinforcement pruning questions they're not pure deep learning questions but they're really on this intersection of these two really exciting fields okay that brings us to the end of this lecture as always please ask questions on moodle if you have any and in our live q a thanks for your attention\n"