The Immediate Update Algorithm: A Comprehensive Approach to Temporal Difference Learning
One of the key concepts in reinforcement learning is temporal difference (TD) learning, which involves updating an action-value function based on the difference between its current estimate and the true value. In traditional TD learning algorithms, this update is typically applied at the end of an episode, resulting in a memory-intensive approach that requires storing all the weight updates from previous steps. However, this can be computationally expensive and may not be feasible for large-scale environments.
To address this issue, researchers have developed alternative approaches that enable online updating during the episode. This is achieved by grouping together states that occurred before the current time step and applying the temporal difference error to a weighted sum of their features. The resulting update is known as an "accumulating trace" algorithm, which enables incremental learning without growing memory requirements.
The accumulating trace algorithm is based on the idea that the same temporal difference error appears in multiple Monte Carlo errors for states that occurred before the current time step. By grouping these states together and applying the temporal difference error to their feature vectors, we can compute an approximate update that captures the essential information from all previous steps. This approach is more efficient than traditional TD learning algorithms, as it eliminates the need for storing all weight updates from previous steps.
To illustrate this concept, let's consider a simple example with four time steps. Suppose we have an episode where the action-value function needs to be updated based on the difference between its current estimate and the true value. The temporal difference error can be computed as the difference between the estimated value and the actual value. By applying this error to each feature vector, we get a series of updates that need to be applied at the end of the episode.
However, instead of storing all these updates in memory, we can group them together by merging their feature vectors into a single vector called the eligibility trace. This enables incremental learning during the episode, where we can update our estimates without growing memory requirements. The resulting algorithm is known as the accumulating trace algorithm.
In addition to the accumulating trace algorithm, researchers have also developed mixed multi-step returns algorithms that combine temporal difference and Monte Carlo updates. By incorporating a discount factor into the temporal difference error, we can compute a weighted sum of errors that captures both immediate and future rewards. This approach enables more accurate learning by taking into account the value of delayed rewards.
One of the key benefits of these accumulating trace algorithms is that they enable online updating during the episode, which can lead to faster convergence and improved performance in complex environments. By applying updates incrementally as we go, we can avoid growing memory requirements and focus on improving our policy through trial and error.
While the accumulating trace algorithm has several advantages over traditional TD learning approaches, there are also some limitations to consider. One of the key challenges is that these algorithms require careful tuning of hyperparameters, such as the discount factor and trace decay rate, which can significantly impact their performance.
In addition, these algorithms may not be suitable for all types of problems, particularly those with large memory requirements or complex state spaces. However, for many applications, accumulating trace algorithms offer a promising approach to temporal difference learning that enables online updating during the episode and incremental learning without growing memory requirements.
Overall, the accumulating trace algorithm represents an important advancement in reinforcement learning, enabling more efficient and effective learning in complex environments. By incorporating these ideas into control and optimization problems, we can develop more accurate and robust policies that can adapt to changing conditions and optimize our performance over time.
"WEBVTTKind: captionsLanguage: enhi and welcome to this fifth lecture on this course in on reinforcement learning my name is adavan hasselt and today i will be talking to you about model free prediction there's a lot of background material on this slide and this is a fairly long lecture you don't have to worry about reading all of this at once the most important chapters for this lecture are chapters five and six and we will cover some material from these other chapters as well but some of that will be shared with the subsequent lectures so this is actually background material for a couple of lectures in a row we will just not exactly go through these in exactly the same sequence as the book does this is why we list a fairly large chunk of background material here feel free to defer some of that reading until later in fact it might help the understanding to go through the material not all at once but to revisit it later also don't forget to pause during the lecture i mean sometimes i will ask you a question ask you to think about something and of course that's a good occasion to actually pause for a second and actually reflect maybe write some stuff down but also like i said this is a fairly long lecture so feel free to make use of the fact that this is a recording and therefore you can pause and maybe even take a break or maybe even consume the lecture over more than one day if that's uh works for you i do encourage you to not wait too long between looking at different parts of the lecture in order not to forget the beginning when you get to the end first i'm just going to recap where we are we're talking about reinforcement learning which we defined as being the science of learning how to make decisions in which there's an agent interacting with an environment and the agent takes actions and the environment will be observed by the agent and then internally in the agent we will have a policy a value function and or a model where in any case we should have a policy because the agent should pick its actions somehow and the general problem involves taking into account time and consequences because these actions can change not just the immediate reward but also the agent's state and also the environment state which means that subsequent rewards might be affected by actions that you've taken that you've taken earlier to recap a little bit where we are in the course right now in these last two lectures we've seen planning by dynamic programming diana told you a lot about this which is all about using computation to solve a known problem so we have a markov decision process or a model if you want to call it that and then dynamic programming is a mechanism to be able to infer accurate predictions or optimal policies for such a problem this and in the subsequent lectures we're basically going to relax this assumption that you have access to this true model and instead we're going to use sampling so we're going to use interaction with the world and we call that model-free and at first we'll talk about model-free prediction in this lecture which is the process to estimate values when you do not have the markov decision process you don't know what it is but you can interact with it this of course is the case when you're for instance in the real world you could imagine that the world maybe has some sort of a really large markov decision process underneath it but you don't have immediate access to that so instead all that you can do is interact for after model 3 prediction we can talk about model frequent control which is the process to optimize values rather than estimating them so please keep in mind that this lecture is about estimating so we're not going to talk about policies much and then we will also talk a little bit about function approximation and some deep reinforcement learning in this lecture and then a little bit more in subsequent lectures and especially deep reinforcement learning will be deferred quite a bit we'll briefly touch upon it finally we will also talk in these upcoming lectures on about off policy learning which is also a prediction task but this is this term refers to making predictions about a policy different from the one that you're following more on that to follow later also in later lectures we will talk about model-based learning and planning policy gradients and ectocritic systems and of course more deep reinforcement learning finally we will cover some advanced topics and current research but only much later okay so let's get started our first topic will be monte carlo algorithms and i'll explain in a moment what that means so the point here is to use sampling we're going to interact with the world and this will allow us to learn without a model if we're sampling complete episodes in reinforcement learning we call this monte carlo so that's a specific usage of the term monte carlo sampling is also used to refer to other things in machine learning in general in reinforcement learning when people say monte carlo they typically mean sample complete episodes an episode is a trajectory of experience which has some sort of a natural ending point beyond which you're not trying to predict any further we'll see examples of that in a moment this is a model free approach because you don't need any knowledge of the markov decision process you only need interaction or samples to make that concrete let's start with a simple example that we've actually seen before in lecture two the multi arm bandit so in the multi-arm bandit we have a bunch of actions and we're trying to in this case just estimate the action values in lecture two we talked about optimizing these action values by picking a smart exploration policy for now we're just talking about model-free predictions so for now we're only interested in estimating these action values so the true action value here is given on the right hand side which is the expected reward given an action and then the estimates at some time step t is written somewhat verbosely here but it's basically simply the average of the rewards given that you've taken that action on the subsequent time steps and we can also update this incrementally we also briefly discussed this in lecture two where you have some sort of a step size parameter alpha and you add to the action value estimate that you have right now so qt of 80 you add the step size parameter times an error term and this error term is simply the reward that you've just observed minus our current estimate all the other action values stay unchanged and then if you pick this step size parameters to be exactly one over the number of times you've selected that action then this is exactly equivalent to the flat average that is depicted above you may have noticed that there's a slight change in notation we've now moved to the notation that is more common when we talk about sequential decision processes so markov decision processes and such where we typically increment the time step immediately after taking the action in the banded literature it's more conventional to denote the reward as arriving at the same time step as taking the action but in reinforcement learning in a more general case we typically increment the time step immediately after the action which basically means we interpret the reward to arrive at the same time as your next observation that's just a small notational note to avoid confusion between this lecture and the earlier lecture on bandits now we're going to extend slightly to make this more general and we're going to consider bandits with states for now the episodes will still remain to be one step long as in the bandit case before and this means that actions do not affect the state transitions which means that if you take an action you will receive some reward and then you will see a new state but this this new state actually doesn't depend on your action so there are now potentially multiple different states so that's a difference from before but they don't depend on your actions and that means that there's no long-term consequences to take into account and then the goal is to estimate the expected reward condition not just on action but also on the state so this is in some sense a slight extension from the normal banded case that we discussed before and these are called contextual bandits in the literature where the state is then often called a context so state and context in that sense are interchangeable terms now we're going to do basically make an orthogonal step in this lecture and we're going to talk about function approximation a little bit and then we're going to return back to these bandits with states and talk about how these things are related before we then go to the sequential case so we're talking about value function approximation to be more precise so far we've mostly considered lookup tables where every state has its own entry or every state action pair has its own entry so think of this as a big table stored on your uh in the robot's brain or in the agent's brain and for every state and action you might see you just have a separate entry that you might update but of course this comes with some problems because there might be way too many states and their actions to actually store this effectively in memory but even if you could store this in memory that might not be the best idea because it might be way too slow to learn the value of each state if you're going to do that completely independently and individually in addition individual states are actually not often fully observable if you talk about these environment states at least so so far i've just set state i didn't actually say specifically what i meant with state i'll talk about that more in a moment but in the simplest case you could consider that the food the environment state is fully observable so that the observation maybe is the environment state and then you could also use that same state as your agent state in that case the aging state the observation and the environment say it could all be the same but even if that's the case then it could be very large but of course often it's also not the case now you could still have a finite agent state and maybe store these separately but then you still suffer from this other problem that might be very slow if you have many different agent states so our solution for those problems is to use function approximation so we write vw or qw where w is now some sort of a parameter vector and we're going to update these parameters instead of updating cells in this in a big table so the parameter vector w will be updated using monte carlo or temporary difference learning which are two algorithms that we're going to talk about in this lecture and the idea would be that we can hopefully then generalize to unseen states because this function should be defined everywhere and the idea is that if you pick a suitable function then if you update a certain state value in some sense you update your parameters that are associated to that state that then the values of similar states would also automatically be updated for instance you could have you would be in this state you could identify that states by looking at certain characteristics certain features and then you could update the values of those features and then maybe if you reach a different state which is different in some ways but it shares a couple of those features maybe we can make a meaningful generalization and learn something about its value before even visiting it now here's a note first on states what do i mean with states so we're not going to necessarily assume the environment state is fully observable so i'm going to just going to recall that there's an agent state update function which takes the previous agent state st minus 1 the action at minus 1 and your current observation ot where the reward could be considered part of that observation or t or we could also spell that out explicitly and we can write it as an input to this agent update function and then our subsequent agent states st is just a function of these previous inputs we talked about this in the first lecture you might recall so henceforth we will use stu whenever we write s rest to denote the agent state you can think of this as either some just a bunch of numbers inside your agent a vector or in the simplest case it could also simply be the observation and indeed it could be as i mentioned before that the environment states if the environment is fully observable is is observable in every step so that the aging state could be equal to the environment state but that's a special assumption that won't be the case all the time so we're just going to talk about agent states whenever we say state for now we're not going to talk about how to learn this agency the update function as you can imagine this can be quite important to have a good agency update function sometimes you can hand construct one sometimes maybe you're much better off if you can learn this we will cover that in later lectures but for now we're just going to set that aside and we're just going to assume we have some sort of an agent update function if you think that's easier to understand the algorithms feel free to consider state whenever you see this in one of the equations just to be the observation for simplicity okay now we're going to talk about linear function approximation there's a couple of reasons for this first this makes things a little bit more concrete if we can talk about a specific function class in addition there's things later that we can say for the linear case that we can't actually completely say for the non-linear case it's easier to analyze theoretically we won't do that yet for now but we will do that later and it's good to have this special case in mind so it's a useful useful special case in which we have a linear function and basically what we're going to assume is we're going to have some fixed feature vector so note we already assumed that we have some sort of a fixed agent state update function so we're going to set that aside where the states come from and in fact now we're even going to set aside states themselves and we're going to just say well in addition to that we have some sort of a feature mapping that turns the state into a bunch of numbers so it's just a vector with a bunch of elements and we're going to consider that for now to be fixed later we might consider learning it but for now it's just a fixed mapping we're also introducing a little bit of shorthand where we simply write x t whenever we mean the features of state at time step t please keep that in mind for instance features could include the distance of a robot from different landmarks or maybe some trends in the stock market or maybe the ps and pawn configurations in chess you can come up with these features by hand sometimes or later we will discuss ways to find them automatically then the linear function approximation approach takes these features and simply defines our value function to be the inner product or dot product between our parameter vector w and the features at the time step x of the state s that we see um probably unnecessary but um the slide is also reminding you what the inner product looks like it's just a sum over the components and it's multiplying each of the features with the associated weight now we can talk about how to update those weights and for that we have to pick some sort of an objective in this lecture we're talking about predictions our objective will be to minimize this loss which defines a squared distance between our current estimates in this case according to this linear function and the true value function v pi obviously we don't have v pi so we're going to replace this with things that we can use but for now keep this objective in mind if we could then compute stochastic gradients for this objective this would converge to a global optimum of this loss function because this loss function is as they say convex and there's only one optimal solution which will so this uniquely defines the optimal uh parameter vector w that does not mean that we can reduce this loss all the way to zero it could be that the features are not good enough to be able to accurately predict the value for every state if we do stochastic gradient descent the update rule is very simple we first note that the gradient of our value function with respect to our parameters w are is simply the features we see here on the left hand side so at time step t if we see state st the gradient of our value function on that state will simply be xt and then our stochastic gradient update if we would have the true value function v pi would simply be to add this term which is the step size parameter alpha times the error term the prediction error times the feature vector and we can use this to update this parameter vector w on every step but of course we have to replace v pi with something that we can have because we don't have e pi i'll get to that in a moment first i want to say that the table lookup case that we've considered for instance for the banded lecture earlier is a special case we can enumerate all of the states of course in order to store these in a in a big table you need a finite amount of state of of states otherwise you can't store them separately and then we could consider the feature vectors to simply be a one hot feature that has zeros on almost all of the components except on the component that corresponds exactly to the state that we see so that means we have exactly as many states as we have sorry exactly as many feature components as we have states and then we note that this means that the value function estimates under the linear function approximation would then simply pick out the weight that corresponds to that state so that means that the weight for that state will essentially be your value estimate for that state okay now we're going to go back to the reinforcement running case so that was kind of like more generic about function approximation and we're going to go back to these monte cardo algorithms basically continuing from before so note that we were dealing with bandits with states and now we're basically going to make q a parametric function for instance a neural network or a linear function and we can going to use this squared loss which we now also multiply by a half that's just for convenience and then we could consider the gradient of this so similar to before we're going to update our parameters w so our new parameters wt plus one will be our previous parameters wt minus a small step size times the gradient of this loss we can then write out the definition of the gradient of that loss which we do here on the second line and then we note that this expectation doesn't actually depend on our parameters so we can just push the gradient inside we get this update which is our step size times an error term reward minus our current estimate times the gradient of this action value and then we can sample this to get a stochastic gradient update as we saw before and the tabular case would just be a special case which is indexes into the corresponding cell for the state action pair so we basically just use the exact same things that we've seen before for the bandit with state setting and we can use a stochastic gradient updates to do the prediction in these cases this also works for very large state spaces this is just regression you could do linear regression you could do nonlinear regress regression which you've probably covered in previous courses so we won't go into a lot of detail on that but it's a valid update and it will converge to the right estimates where these estimates are i'm reminding you limited in the sense that you can't actually expect all of your values to be perfect because it might be that your function class is just not rich enough to actually get all of the values completely accurate in every state but this process will converge under suitable assumptions to the best parameter vector that you can find so again for linear functions we basically now are going to extend the action value approach also to linear functions where we're going to assume that we have features for each state action pair and then we can just multiply rate parameter w with these features which means that the gradient of our action value function will simply be the features for that saved action pair which means that this stochastic gradient descent update for our weights will then look like this where we simply replace the gradient with those features so over the linear update this update corresponds to a step size times your prediction error times a feature vector and for the non-linear update you have very similarly your step size times your prediction error times a gradient and a lot of our next algorithms would look exactly the same they will just change certain aspects of this for instance here we're still considering the bandit case we're just considering learning expected rewards so there's no sequences yet and that's where we're going to go next now we're going to consider sequential decision problems so we're still doing prediction our goal is just to predict the value of a policy in subsequent lectures where i will be talking about control optimizing the policy but for now we're sticking to prediction and now we're just going to sample trajectories under our policy and of course not shown here on the slide but of course the probabilities of these trajectories also depend on the underlying dynamics under the underlying markov decision process then so maybe somewhat obviously we can extend the banded approach to full returns by simply just sampling a full episode and then constructing a return i'm reminding you that a return is simply the accumulation of rewards so we have gt the return from time step t into the future as defined as the immediate reward rt plus one and then the discounted next reward rt plus two and so on and so on until in this case the end of the episode which arrives at sometimes the big t which is in the future a return will only stretch as far as the episode and then after the episode is done we imagine we will be reinitialized in the states and we can go through this whole process again and then the expected return which is our goal is defined as simply the expectation of these returns so similar to the bandit with state setting we can sample this and we can use this instead of the expected return as a target in our updates this algorithm that does that is called monte carlo policy evaluation and it's covered in chapter five from such an ambardo now i'm going to walk you through an example to give you a little bit more of intuition of how that algorithm works in practice this example is in the game of blackjack and blackjack is a card game in which the goal is to get more points than an opponent called the dealer we're going to go first and we're going to try to accumulate as many points as we can but not more than 21. so if you get more than 21 points you go bust as they say and so therefore basically your your goal is to get as close to 21 without going beyond it to do so you're going to draw cards and each of these cards is going to be worth a certain amount of points the number of cards are simply worth how however like large the number is so a card with a three or four is worth three or four points all of the picture cards the jack queen and king are worth 10 points and the ace is a special card it's worth either 11 points or you can pick it to be worth one point this is useful for when you draw a card and you go above 21 if you then had an ace you can say ah no my ace is now no longer worth 11 points now i'm going to make it worth one point instead and now you're below 21 again we're going to formalize this problem in in the following way where we're going to enumerate states or so we're going to go for a tabular approach and this state will consist of our current sum and this current sum is the sum of the cards you have so far and this will be between 12 and 21 for reasons that i'll explain in a moment i've already said if you go beyond 21 you have already gone bust so then the episode ends so that state is unimportant but we're going to start with any number between 12 and 21 and in addition to that we can also see the dealer's card so the dealers card we're only seeing one of them the dealer is going to play after you so they are going to draw more cards after you're done but you can already see one of these cards and this is informative to tell you whether you maybe should continue or not and then in addition to that we also have an additional state variable which tells us whether we have a usable ace which basically just means do we have an ace and can we make that a's worth 11 without going above 21. so say you have 16 points let's say you have an 8th and a 5 and let's say you then draw a 10. this would bring you to 26 points which as i explained to to make you go bust but then you can say i know my ace is now only worth one point and i'm back at 16 points and i can go again but the state will have changed because now you no longer have a user ways in terms of the action space there's two different actions you can do you can either stick at which point it's now the dealer's turn and they will resolve this will then terminate your episode or you can draw which means you just take another card if you draw you can draw again in the next step or you could stick in the next step when you stick the episode always terminates and you get a plus one if then the dealer doesn't get a higher sum than you or if the dealer goes bust which is also possible so if the dealer goes above 21 they lose if they don't go above 21 but they have fewer points than you they also lose and you get plus one if you happen to arrive at exactly the same number you get zero but if the dealer manages to get more points than you without going above 21 then you get minus one if instead you draw if you go above 21 and you didn't have a usable ace you cannot avoid this from happening then you get minus one and the episode terminates immediately the dealer has now won otherwise you get zero and then the game continues so you could draw again or you could stick as i mentioned you start with at least 12 points this is simply because if you have fewer than 12 points you should always draw more cards because you can never go bust and therefore if you say have 11 points there is no card that could bring you above 21 because even if you draw an ace you could always choose it to be worth one so you can always get more points so you can basically think of this as a process a stochastic process that brings you to the initial state of your episode note that the state description here is slightly partial observable because we're just giving you a number so you don't actually know what that consists of and even knowing whether you have a usable ace or not doesn't actually give you all the information that you could have because for instance you could have two aces and then that will be hidden from you so there's some slight partial observability here but that turns out not to be a big factor then what we do is we run monte carlo learning so we're going to generate a whole bunch of episodes and we're going to sample the returns for those episodes for some fixed policy and then we're going to generate these plots and i'm going to explain these in a second which show your value estimates for that policy and then of course in later lectures we can talk about oh how should we then maybe improve our policy to do better but this is a reasonable policy in which i believe you draw if you have fewer than 17 points and otherwise you stick or it's something similar um and what's shown here's four plots and i'll explain what these are first i want to draw your attention to the bottom right where we see what the axes are on this plot so one axis is which card the dealer is showing which is either an ace or a two or a three and one or a ten where we merge all of the picture cards just into a ten as well because they're all worth 10. and on the other axis we see the current sum that we have it's either 12 13 and 1 or 21. these z-axis the height is basically the estimated value we see it's always between -1 and 1 this is because the maximum reward you can get during an episode is plus one at the end or the lowest reward is minus one at the end all of the intermediate rewards if there are any are always zero because if you draw but you didn't go bust you just get zero so the total return for each episode is between minus one and one or it's either minus one or zero or one and now i want to draw your attention to these plots and i wanna go i'm going to ask you a couple of questions that you can think about so feel free to pause the video if you want to think about these things and in particular i want to draw your attention now to the top left plot the left column here corresponds to having seen ten thousand episodes the right column corresponds to having seen half a million episodes now the top row corresponds to having a usable ace and the bottom row corresponds to not having a usable a's interstate so the first question i want to ask you is uh why does this top left plot look so much more bumpy than the plot to its right and then the second question is why does it look more bumpy than the plot below it so feel free to think about that for a second and then i'm going to give you uh my explanation so maybe the first one was uh maybe somewhat obvious so after 10 000 episodes we don't have a lot of data for each of these states yet but after half a million episodes we have accumulated by now quite a bit of data for each of these states so our value estimates have improved so maybe the difference between the left and the right was somewhat obvious maybe the difference between the top and the bottom is a little bit less obvious but the reason for that is i'm going to argue that the states at the top in the top row are actually less common than the states in the bottom row in a normal deck of cards there's 52 cards out of which only four are aces so states in which you have an ace are actually comparatively rare so even on the left we've seen 10 000 episodes in total that doesn't mean that every state has seen the same amount of episodes and in fact the states in which you have an ace may have been visited much less often in some cases now finally i just want to draw your attention to the shape of the value function where we we see maybe somewhat expectedly that if your own sum is high then the value function becomes higher and in fact if your sum is 21 then it's quite close to plus one except if the dealer is showing an ace because if the dealer is showing an ace it's actually not that unlikely that the dealer will also get to 21 at which point your return will be zero rather than plus one okay so this is just an example of monte carlo learning and how you could use that to find the value of a policy and maybe somewhat obviously we can then later use this information to then improve our policy but we won't go into that yet so what we've seen here is that monte carlo algorithms can indeed be used to learn value predictions unfortunately when episodes are very long learning could be quite slow so the example for blackjack was an example in which episodes are actually very very short right they only take like maybe one or two or three actions but they won't take hundreds and hundreds of actions or maybe even more than that but if they do and you have to wait all the way until the end of an episode every time before you can start updating that might be tedious so we have to wait until an episode ends before we can learn why do we have to do that well because the return is not well defined before we do right so we're using the full return of an episode to update towards and that means we have to wait until the episode ends before we even have the thing that we want to use in our update in addition these returns can have very high variance in some cases especially if episodes are long so are there alternatives are there other things we could use other algorithms that maybe don't have these downsides and of course i wouldn't be asking this question if there wasn't an affirmative answer so this brings us to one of the most important concepts in reinforcement learning called temporal difference learning so i'm just going to start by reminding you of the bellman equation that we've talked about or diana actually talked about at length in the previous lectures the bellman equation relates the value of a state with the value of the next state or the expected value of the next state and this is actually a definition of the value of a policy so the value of a policy is defined as the expected return but turns out to be exactly equivalent to the expected immediate reward rt plus one plus the discounted true value of that policy in the next state st plus one we've seen that you can approximate these values by iterating basically turning the definition of the value into an update so the difference here is now that the v pi within the expectation has been replaced with our current estimates v k because we're doing this in iterations maybe across all states at the same time so we denote this with this iteration with some number k and then we update our value function by replacing maybe all of them at the same time you could do this for all states at the same time with a new estimate vk plus one which is defined as the immediate reward rt plus one plus the discounted current estimate of the next state value vk and we've seen that these algorithms actually do learn um and they they do find the true value of a policy now we can see there there's on the right hand side there's an expectation but we could sample that so maybe we can just plug that in and we can just say oh maybe we just see a sample rt plus one plus the discounted value of the next state st plus one and then use that well maybe you don't want to update all the way there so instead we're going to argue that's going to be too noisy so instead just take a small step so this is this now looks very similar to the monte carlo learning algorithm but instead of updating towards a return full return we're going to update towards this other target which is the reward plus the discounted estimate for the next state so the change here between monte carlo learning and this algorithm is that we've replaced the full return with something that uses our current value estimates instead note that i've written down the tabular update here but you can extend this in the similar way as we did for monte carlo learning to function approximation or actually we did sort of bandage with states but the bandits with states could be replaced with monte carlo learning by simply swapping out a reward for the return and then similar here we could swap out that return for this new target so just to recap we're in the prediction setting we're learning v pi online from experience under a policy pi and then the monte carlo update looks like this the tabular monte carlo update we have some state some value estimates maybe vn or vt you could also call this i'm calling it n here i could have also used k as before because these updates cannot actually happen at time step t because the return is not yet known completely at time step t right we only know the return at the end of the episode which might be some data time step so instead of saying which time step that actually is i'm just saying oh there's some iterate iterative procedure here and we're updating our value function by using the return temporal difference learning which is that new algorithm which we just talked about instead uses this this new target which just unrolls the experience one step and then uses our estimates to replace the rest of the return and this is called temporal difference learning because the error term here is called a temporal difference error which looks at one step into the future and looks at the difference in value from what we currently think and comparing that to one step into the future this temporal difference error which is simply defined as rt plus one plus the discounted value of this state st plus one minus our current value estimate of the value at st is called the temporal difference error and we typically denote this with a delta so keep that in mind delta t is the temporal difference error and it's defined as as this so now we can talk about these algorithms and maybe get a bit more intuition by thinking about them how they work with the problem at hand so dynamic programming works like i like this there's some three of possibilities here you're in some states and you consider all the possible possibilities that might happen next states here are denoted by these white nodes actions are black smaller nodes and then what we see here is effectively that sorry the dynamic programming looks at all possible actions in this case too and then also at all possible transitions for each action so in this case each action can then randomly end up in two different states so there's four states in total that you could end up in after one step and dynamic programming considers all of these which of course requires you to have a model that allows you to consider all of these possibilities conversely monte carlo learning takes a trajectory that samples an episode all the way until the end this terminal state denoted here with a green box with a t in there and then it uses that trajectory to construct the return and updates the value of that state at the beginning towards that return and of course you could also update all of the other states along the way towards the return from that state and then this new algorithm temporal difference learning instead only uses one sample so we see some commonalities here with with dynamic programming in the in the sense that we're only doing one step but it it does sample so it doesn't need a model so there's some commonality with dynamic programming in the sense we're only doing one step deep and there's some commonality with monte carlo learning in the sense that we're sampling so we call this usage of our estimates on the next time step bootstrapping this is different from the use of the term bootstrapping as the statistical bootstrap which refers to taking a data set and then resampling from the data set as if it's the underlying distribution it has nothing to do with that in reinforcement during bootstrapping typically refers to this process of using a value estimate to update your value estimate this is indicative of pulling yourself up on your own bootstraps essentially and it's good to to keep that in mind that that's just the the term for doing that and that means that under this terminology monte carlo learning does not bootstrap it does not use value estimates to bootstrap upon to construct its return its targets but dynamic programming does and temporal difference learning also does these both use our current value estimates as part of the target for their update and then in addition additionally we can think about sampling where we similarly see the monte carlo samples but now dynamic programming does not sample it instead uses the model and temporal difference learning does sample so we see we have these three algorithms with different properties and of course we can apply the same idea to action values as well where we have some action value function q and we simply do exactly the same thing we did before where we take one step and now we also take the subsequent action immediately a t plus one and we can use that this to then construct the temporal difference error exactly in the same way as before all that i did here is essentially replace every occurrence of a state with a state action pair index on the same time step this algorithm is called sarsa because it uses a state action reward state and action this name was coined by rich saturn now in terms of property templation learning is model free it doesn't require knowledge of the marketization process and it can therefore learn directly from experience and interestingly it can also learn from incomplete episodes by using this bootstrapping this means that if the episode is really long you don't have to wait until all of all the way in the end of the episode before you can start learning and this can be quite beneficial because then you can also learn during the episode now the extremist case of this that you could consider is maybe what if your lifetime is one big episode right what if there is no termination and some models are indeed effectively um formalized as such and then it becomes essential to be able to learn during the episode you can't just wait until the end of the episode because there's only one episode now to illustrate the differences between these algorithms monte carlo and temporal difference learning i'm going to step through an example which is called the driving home example and this one's also due to satsang ambarto so how does that look we're going to enumerate a couple of states small number of states and the idea is that we start at the office and we want to go home now at first we're going to talk about the columns here so the first column shows the state we're in the second column shows the elapsed minutes so far the difference in each step on these elapsed minutes you can consider your reward so between leaving the office and reaching the car we could say five minutes have passed and we could call that our reward we could basically say oh the reward on this transition was five and we're just predicting here so we don't actually care about the sign of the reward whether it's minus five if you would like to maximize the the speed you might want to minimize the minutes or something like that we don't have to worry about that because we're just doing predictions so we're just saying there's a reward of five along the way then the column after that the third column the predicted time to go is our current value estimate at the state that we're currently in so when we're in just leaving the office we predict it's 30 minutes in total to get home this is the accumulation of the rewards along the way then just as a helpful uh mental helper there's a final column here the predicted total time and this is simply a sum of the previous two columns this adds together how many minutes have already passed with the predicted time still to go because this will give us a feeling for how that total time is changing so when we're leaving the office as i mentioned so by definition zero minutes have passed from leaving the office and we're predicting it's still 30 minutes but then when we reach the car we notice it's raining and maybe that's uh bad because maybe that means that it tends to be busier on the highway and therefore even though five minutes have already passed maybe we're also a little bit so to get to the car we still predicted still 35 minutes so more than before which means our total predicted time has actually gone up to 40 minutes so the way to interpret this is that the reward along the way was five and then the new prediction is 35 then in our next stage we exit the highway the total amount of time elapsed so far is 20 minutes so you can think of this as the reward along the way was 15 minutes because it was five minutes when we reached the car now it's 20 minutes and from now we predict it's another 15 minutes to go which means that the predicted total time has actually gone down a little bit maybe it was less busy on the highway than we thought and things went a little bit more smoothly than we thought but then we exit the highway and we find ourselves behind a truck 10 minutes later so another 10 minutes have passed the reward could be considered to be 10 and from this point we we consider it another 10 minutes to go so the total predicted time has gone up again to 40. then at last we arrive at the home street 40 minutes have already passed so another 10 minutes have passed and we predict it's still three more minutes to go so our total predicate time has gone up again a little bit to 43 but our current prediction turns out to be completely accurate and we arrive home after 43 minutes so another three minutes now what do these different algorithms do well the monte carlo algorithm would basically just look at the total outcome and therefore it would then update all of these states that you've seen so far to take into account this new total outcome it basically just looks at the whole sample and it says well when you were leaving the office you thought it was 30 but it was actually 43. when you reached the car five minutes had passed you thought it was still 35 minutes more which means our total prediction was 40. this should have been 43 so that means that instead of predicting 35 you should have predicted 38 perhaps run when reaching the car and similarly when exiting the highway or when reaching the secondary road when we got stuck behind the truck we have to update these predictions upwards that's what monte carlo does if we then look at the right hand plot for temporal difference learning it looks a little bit differently when leaving the office we then reached the car and it was raining and we predicted it was still more time to go there five minutes already passed and we thought it was 35 more minutes so when we said oh from the office it's just 30 minutes now we're thinking when we've reached our car no actually it's more like 40 minutes in total so we should update that previous state value upwards to 40. and you can immediately execute that you could immediately update that state value update but then when we reached the car and it was raining we thought it was 35 minutes but then when we exited the highway 15 of those minutes had passed and then we thought oh actually from this point onwards it's it's not that long anymore i can go back a slide we're here now we were predicting from reaching the car that it was 35 more minutes but then when we exit the highway we notice it's actually only 15 minutes later than it was before and we think it's another 15 minutes to go that means that instead of 35 this should have maybe been 30 instead so it tells you to actually update that one down so the purpose of showing you this is not to say that one is right and the other one is wrong but it's to show you the differences between these different algorithms how do they operate we will see more examples later on as well so now we're going to compare these algorithms a little bit more and we're going to talk about the advantages and disadvantages of each so as i mentioned temporal difference learning can learn already before knowing the final outcome it can in fact learn online after every step that it has seen whereas monte carlo learning must wait until the end of the episode before the return is known and before it could actually execute its update in addition sample difference learning can learn without the final outcome this is useful framework for when you for instance only have incomplete sequences it could be the case that you have a database of experience that you want to learn from somehow but the database is corrupted or is missing some data and then thing for different temporal difference learning could still be used on the individual transitions that you do have monte carlo cannot do that and it really needs the full return in order to be able to do its update in addition the ability to be able to learn without knowing the final outcome means that temporal difference learning can also work in continuing environments in which there are no episode terminations whereas of course monte carlo needs full episodes in order to do its updates finally well not fine there's one more after this temporal difference learning is independent of the temporal span of the prediction so what do i mean with that that's a whole mouthful i mean with this that the computation of temporal difference learning is constant on every time step how many steps you want to do in an episode does not matter in terms of the computational complexity on each time step for temporal difference learning so why is that not true for monte carlo well monte carlo needs to store everything so to td can hear from single transitions but monte carlo must store all the predictions you've seen in an episode in order to be able to update them at the end of the episode so that means that the memory requirements for monster carlo actually grow when the episode becomes longer and longer this is a pure computational property it has nothing to do with the statistical properties of these algorithms but on the flip side temporal difference learning needs reasonable value estimates if these value estimates are very bad then obviously if you're going to construct targets using these then maybe your updates won't be very good and that means that actually there's a little bit of a bias variance trade-off going on here the monte carlo return is actually an unbiased estimate of the true value this is in fact how the true value is defined is the expectation of these returns but the temporal difference target is a biased estimate of course unless you already have accurate predictions but that's an edge case because we don't assume that we have them in general now the temporal difference target does have lower variance because the return might depend on many random actions transitions and rewards whereas the temporal difference target only depends on one random action transition and reward but in some cases temporal difference learning can have irreducible lies for instance the world might be partially observable and the states that we're plugging into these value estimates might not tell us everything that's already a problem for monte carlo learning because it means that the states that we're updating don't have all the information maybe to give you a complete accurate accurate description and therefore your value estimates will be a little bit off but you could imagine that this can get worse and indeed you can show this theoretically as well that this can get worse if you're additionally using these value estimates which are a little bit off because your state doesn't tell you enough in constructing the target for your update implicitly monte carlo learning a different way to think about this would account for all of the latent variables happening along the way so even though you can't observe exactly where you are the return itself would just take that into account because the return itself does depend on all of the environment variables that you can't maybe observe similarly but a little bit different the function used to approximate the values might fit poorly and this might also be true in the limits it might be that your function class let's say a linear function can't actually hold accurate predictions for all states if that is the case then temporal difference learning has irreducible bias in its target and therefore also in the values it eventually learns in the tabular case however both monte carlo and temporal difference earning will converge to the true value estimates we will talk more about these properties and especially about the function approximation part in later lectures so now to build even more intuition let's go into another example called a random walk so how does that look we have five states it's a small example it's meant to be an intuitive example and we have in each of these states two actions so we start in the middle state denoted c here and we either go left or right with equal probability the initial value estimates are a half for every state and above these transitions you see the reward depicted which is zero in almost all transitions except if you take the right action from state e then the episode terminates with a reward of one additionally if you take the left action from state a the episode also terminates with a reward of zero all of the other transitions are worth zero the true values happen to be 1 6 for state a 2 6 for state b and so on and so on until 5 6 for state e it might be an interesting exercise for you to go through this and to actually prove that this is the case using dynamic programming you could write down the probability of each transition and you could write down the reward function and you could do dynamic programming to find these value estimates so we put that market transition process here on the top and then we're going to talk about updating the values and first we're going to show you what td does so there's a couple of lines in this plot and these lines correspond to our value estimates for each of these states after a certain number of episodes so the line demarked with zero is fully horizontal and that's because we've initialized all the state values at one half then there's a line marked one here and it's actually completely the same as the line marked zero on most states except for state a the reason for that being that apparently this first episode terminated by stepping left from state a with a reward of zero all the other transitions this is an undiscounted problem so on all the other transitions basically when we for instance maybe steps from state b to state c we would have seen a reward of zero along the way and our temporal difference error would be reward plus next state value in this case undiscounted there's no discount factor or equivalently the discount factor is one and that next state value at c would be one half because that's where how they were initialized so our target would be one-half but our current estimate at state b would also be one-half so the total temperature difference error would be zero so we wouldn't update the value of state b on this intermediate step but eventually we reach state a and we take this step into the terminal state the terminal state by definition always has value zero so the temporal difference error for this last transition was actually minus a half because it's zero reward plus a zero next state value minus our current estimate for state a which was a half and then we see that we've updated the value of state a in a tabular way slightly down and in fact roughly 10 of the way down or maybe even exactly 10 for the way down it started at 0.5 and now it seems to be roughly around 0.45 so from this we can infer that the step size parameter the learning rate was 0.1 for this td algorithm then we can see that after say 10 episodes all of the values have by now been updated a little bit and after 100 episodes we're very close to these true values depicted here as a diagonal so this is just stepping through this problem step by step and it can be quite informative if you implement these algorithms for the first time to actually really take it take it easy and go step by step and look at every update to also make sure of course that there's no errors in the implementation but also to better understand these algorithms and then we could run these these different algorithms monte carlo learning and temporal difference learning on this problem and look at the root mean square error so this is the prediction error which we've now condensed over all states so we're looking at the total root mean squared error across all states or the average error we can see here that the on the x-axis we have the number of episodes we've seen so far so learning proceeds when we go from left to right and on the y-axis we see the error of the state predictions state value predictions so first i want to draw your attention to the darkest line in both these plots the black line this corresponds to the lowest step size that we've tried which is 0.01 we see that both for monte carlo learning and for temporal difference learning we see the smooth progression where the error goes down as the episode number goes up but it's fairly slow right so the error at um after 100 episodes is not that low yet and we can see it can clearly go down farther if we can leave it running longer but if we wanted to learn faster so if we wanted to only have maybe only had 100 episodes and we had to stop here it maybe makes sense to pick a higher learning rate or higher step size and we can see in that the brown curve which corresponds to a three times larger step size 0.03 indeed has a lower error not just at the end but actually at every episode along the way however then we start seeing a trade-off so i'm going to draw your attention again to the monte carlo plot and let's now consider the brightest line this corresponds to a fairly high step size of 0.3 and we can see that learning indeed is very fast at the beginning but then almost immediately stabilizes it doesn't go much below 0.45 roughly and we could see that the variance is also quite high now why is this the case this is the case because the monte carlo update itself has is high variance and we cannot reduce the variance further because we don't have a decaying step size here which means that the total error of this algorithm after a couple of episodes will be just the variants in the updates similarly if we reduce the step size slightly to 0.1 we see that the learning is slightly slower at the beginning but then goes lower the arrow does go lower but then also stabilizes around in this case slightly above 0.3 we can see something similar happening for td where there's this trade-off you can learn very quickly at the beginning but maybe the arrow stabilizes at a higher point but if we compare a temporal difference learning to monte carlo learning we can see that temporal difference learning allows you to set the step size higher and also has a different trade-off and indeed a lot of these errors are smaller than for monte carlo so if we look at for instance at the midway point at 50 episodes we can see that temporal difference learning then prefers a step size of 0.1 if you had to pick one constant step size whereas monte carlo learning prefers a lower step size of 0.03 because it has higher variance and also the error for monte carlo will be higher even if we tune the step size among these four four options now obviously you could also extend this and consider step size schedules which start high and maybe go lower when you learn more that's not the point here necessarily i just want to show you these properties of these algorithms where you can kind of clearly see from the plots that monte carlo simply has a higher variance than temporal difference earning and in this case that leads to higher errors for any constant step size essentially if you want to tune over on your constant step sizes okay now we're going to look even more in depth at these properties of these algorithms by considering batch updating so the we know that tabular monte carlo and temporal difference learning do converge to the true value estimates so if we store all of the values separately in the table we can update all of them independently to eventually become accurate under the conditions that your experience goes to infinity and your step size decays slowly to towards zero but what about finite experience of course in practice we won't actually have infinite experience we're going to learn for some time it might be a long time but it won't be infinitely long and we won't actually decay our step size all the way to zero perhaps so what we're going to do now is we're going to consider a fixed batch of experience and specifically we're going to consider having collected k different episodes and each of these episodes will consist of a number of steps so these number of steps per episode can differ but they might all have a large or small number of steps and then we're going to repeat each sample from these episodes and apply either monte carlo learning or temporal difference learning it says here td0 reasons for calling this algorithm td0 will become clear later but this is just the standard temporal difference learning algorithm that we discussed so far you can view this as basically similar to sampling from an empirical model where we talked about in the dynamic programming lecture having a model and then being able to use that to learn in this case you could also consider having the data set which in some sense defines frequencies of observed frequencies of transitions which is similar to having a model but it's an empirical model and it won't exactly fit with the real underlying model because you only have a finite amount of experience and now we're going to apply this idea of batch learning to a specific small example in which we only have two states where the purpose is for us to be able to reason through this all all the way so the two states let's just call them a and b there's not going to be any discounting and let's say we have eight episodes of experience here each line below denotes exactly one episode so one of these episodes starts in a state a and then gets a reward of zero then proceeds to a state b and then another reward of zero and then the episode terminates the next episode on the next line we started in state b instead of a and then we got a reward of one and terminated that happens more often so six out of these eight episodes are of that form where we start in b and then terminate with a reward of one and then we also have one episode that did start in b as well but it terminated with a reward of zero now i want you to think about maybe even without thinking about these algorithms at all i want you to think about what are the values of states a and b what do you think these values should be if this is all the information that you have you have absolutely no information apart from this data set so i encourage you to pause the video perhaps and think about this for a little bit and then i'm just going to proceed and going to tell you what i think are maybe plausible values so it could be that you came up with maybe even more than one answer or it could be that some of you came up with one answer and some came up with a different answer so let me motivate two different answers here first let me start with state b maybe that one's somewhat obvious so from state b we've seen eight episodes that were ever in state b and six out of eight times we saw a plus one and two out of eight times we saw zero so maybe the appropriate value for state b is 0.75 we've also seen one episode that was in state a and in that episode we got a total return of zero so one could reasonably say maybe the value of state a is always zero or at least that maybe is our best guess that we could do now when i say that some of you might actually object and say no no no that's not the right way to think about this the right way to think about this is that whenever you were in state a sure it only happened once but like whenever you were in state a you transitioned to state b and there was a reward of zero along the way this implies that state a and b must have exactly the same value that is also a reasonable argument but that means that the value of state a would be 0.75 which is quite different from zero if you were using that second line of reasoning effectively what you're arguing for is that the underlying model maybe looks like this that maybe this is the suitable way to think about this a goes to b a hundred percent of the time as far as we know and then b gets plus 175 percent of the time and zero 25 at the time so what is the right answer here well that's actually a little bit unclear and i'm going to explain why both these answers are actually in some way reasonable so monte carlo learning converges to the best mean square fit for the observed returns so you can write this as follows where we sum over the episodes k from one to big k and then within each episode we look at all the time steps from one to t k big t k for that episode and then we could look at all of the returns that you've seen and compare that to our current value estimates and then we could say we want to minimize the squared error between these returns and that we've observed and these value estimates and that indeed sounds like a very reasonable approach right we're just minimizing the difference between the returns we have seen and the value estimates that we have in the example that we've just seen this would imply that the value of state a is zero because the only return we've ever seen in state a was zero instead temporal difference learning converts to the solution of the max likelihood markov model given the data that's what we saw on the previous slide this is the most likely model given the data that we've seen so far and then temporal difference learning turns out finds that solution that corresponds to this model so if you agree with that approach if you say well that's what you should be estimating and then you should be solving that that's what temporal difference learning does so this would be the solution of the empirical markov decision process assuming that the empirical data you've seen is actually the true data so in the example this is what gives you the same estimate for both states a and b now you might find one better than the other but why would you take one or the other this actually is a little bit of a subtle argument and turns out you can think about it as follows where temporal difference learning exploits the mark of property and this means this can help learning in fully observable environments what i mean with that is that the assumption essentially when we built that model that empirical model in the previous example the assumption there is that if you're in state b then it doesn't matter that you were in state a before we can just estimate the value of state b separately and this will tell us all that will happen from state b onwards that's basically making an assumption that state b is markovian instead you could also say and this is what monte carlo exploits well what if we don't assume that then we could say well whenever we were in state a it turns out that our second reward was zero and this could be related it could be that whenever we are in state a that we already know that all of the rewards in the future are going to be zero if we then reach state b it might be that this is still true but we just can't see it it's a latent variable it's a hidden variable this means that the world is not fully observable if that were true for the problem that we were in that would be the better estimate perhaps and indeed learning with monte carlo can help in partially observable environments because it makes less of an assumption that the states are useful to construct your value estimates so we mentioned this before so in some sense you can view this example with this two-state batch learning as an example of this difference in terms of how they deal with fully observed versus sparsely observed environments important to note is that with finite data and also with function approximation the solutions even in the limits might differ between temporal difference learning and monte carlo these are two different statements we saw just we've seen it just now for finite data it's also true for function approximation i haven't actually shown you that but we'll get back to that in later lectures so now of course maybe a natural question is can we maybe get the best of both first i'm just going to show you a unified view where we put dynamic programming in the top left and the reason why it's there is that left corresponds here to shadow updates where we just look one step into the future the top versus bottom in the top we we look at mechanisms that look full breath of course in order to do that you do need to have access to the model so that means that for instance if we look both full breadth and full depth this would give you exhaustive search i'm just showing that to complement the other approaches i'm not saying that's an algorithm that you should be using it's computationally quite expensive and of course you also need a model but it clearly fits in the figure in terms of like comparing breadth and depth of these algorithms then if we go down we go to these algorithms that only look at one trajectory and these can be used when we only can sample when we only deal with interaction so we see temporal difference learning in the top sorry the bottom left and monte carlo in the bottom right where we can think now about temporal difference learning as having a breadth of one but also only a depth of one so we can just take one step in the world and we can use that to update a valley estimate whereas monte carlo makes a very deep update it rolls forward all the way until the end of the episode and uses that full trajectory to then update its value estimates now as discussed temple difference learning uses value estimates which might be inaccurate and in addition to that which we haven't quite uh talked about so much yet the information can propagate quite slowly i'll show you an example of this in a moment this means that if we see a reward that is quite useful like it's a surprising reward tempo difference learning will by its nature only update the state value immediately immediately in front of it if in that episode you never reach that state again that means that all of the other state values don't learn about this reward whereas monte carlo would update all of the previous states that you visited in that episode eventually to learn about this new reward so that means that temporal difference learning has a problem in the sense that information can propagate quite slowly backwards and therefore the credit assignment could be quite slow now monte carlo learning does propagate information faster as i just said if you do see your surprising reward monte carlo learning will at the end of the episode tell all the previous states about that but of course the updates are noisier and it has all the other properties that we talked about before now we can actually go in between these and one way to do that is as follows instead of looking exactly one step ahead as temporal difference learning does we could consider looking two steps ahead or three steps or generically end steps ahead then we could consider monte carlo learning to be at the other end of other extreme essentially where monte carlo looks infinitely far into the future up until the end of the episode written in equations you could write that as follows where maybe we can introduce new notation where we use superscript with brackets so superscript since g with one between brackets is now a one step return which takes exactly one step in the world rt plus one and then bootstraps at our current value estimates of sd one so one step a one step return corresponds exactly to temporal difference learning as we defined it before an infinite step return shown in the bottom here would correspond to monte carlo learning because if you take infinite steps you always will reach the end of the episode before you choose to bootstrap but then in between we could consider for instance a two-step approach which takes not just one reward but takes two rewards into account and then bootstraps on the value of the state at time step t plus 2. in general we can say the n step return so oftentimes i say multi-step returns here in the tight title of the slide but colloquially people often refer to these mechanisms as air using n-step returns and it's defined by simply taking n rewards into account and then appropriately discounting them please note that the last reward reward at t plus n is discounted only n minus one times which is consistent with the one step approach but the value estimate is then discounted n times and then we can just use this in our updates so does that mean that we have to wait a little bit of time before we can actually execute this update and we do have to store some estimates or states along the way but only as many as we want so if we have a 10 step return we have to wait 10 steps every time before we can update a state value and we have to store the ten states along the way every time in some something of a buffer so it has these intermediate properties that are both statistically and computationally a little bit between a temporal difference ordering on one extreme and monte carlo learning on the other extreme so now i'm going to show you some examples to make that a bit more concrete first we're going to talk about an example to illustrate essentially this property that td doesn't propagate information for backwards and for that we're going to use sarsa i'm going to remind you that sarsa is simply temporal difference learning for state action values so if we look on the left we see a specific path that was taken we started apparently over here then we went right right up right up and so on in the end we reached the goal if we would then do one step td in this case one step sarsa we would only update the state action value for the action that led immediately to the goal this is of course assuming that all of the other intermediate steps have no information maybe the rewards are zero maybe your current state estimates are also zero and maybe you only get a plus one when you reach the goal if we could then instead consider a 10-step update it would actually update all of these state values appropriately discounted along the way so it would propagate the information much further back and then if we consider the next episode this could be beneficial because in the next episode it could be that you start maybe here again and then if you only did one step td or sarsa in this case it could be that you just meander around without learning anything new for a long time whereas if you would have done a 10 step return at least you're more likely to more quickly bump into one of these updated state values and then information could start propagating backwards to the beginning where you start so we can apply this to a random walk just to see and get a bit of more of intuition so we see the same random walk that we talked about before here at the top but now let's consider giving it 19 states rather than five but otherwise it's the same there's a starting state in the middle there's a plus one reward on one end and there's a zero on the other end and then we can apply these n-step-r algorithms and see how they fare so what you see here on the slide is something called a parameter plot because on the x-axis we have a parameter in this case the step size parameter and on the y-axis we see the root mean squared error over the first 10 episodes so we're not looking at infinite experience here we're just looking at a finite amount of experience and then we see how do these algorithms fare if you then look at all of these different step sizes so for instance let's look at n is one n is one i remind you corresponds exactly to the normal temporal difference learning algorithm that we discussed before we see that the best performance or the lowest error if we only care about the first 10 episodes and we have to pick a constant step size is for a step size around maybe 0.8 if you set it higher or lower the error is a little bit worse this has been averaged over multiple iterations so this is why these curves look so smooth this is of course a fairly small problem so you could run this quite quite often to get very good insights on what's happening very precise insights i should maybe say and what we notice here is if n is two so we consider a two-step approach we see that if the step size is really high if it's one that it actually does a little bit worse and this is because the variance is in some sense higher but you can actually tune your step size around in this case 0.6 to get lower arrows than where possible with the one step approach over the first 10 episodes then taking n is 4 is maybe even a little bit better but notice again that the preferred action sorry the preferred step size parameter is again a little bit lower this is because if we increase n as we do here the variance goes up more and more and more and that implies that we need a lower and lower step size to make sure that we don't have updates that have too high variance which would make our error higher then if we go all the way up here we see a line that is marked with the number 512 that is a 512 step temporal difference learning algorithm which in this case is essentially monte carlo because there are very the probability of having an episode that is more than 512 steps long is very small and we can also see that by the fact that the 256 step tempo difference earning algorithm is actually quite similar because both of those are already quite similar to the full monte carlo algorithm for these algorithms we see two things first they prefer very small step sizes of much much smaller than temporal difference learning the one september difference learning and in addition even if you tune the constant step size very well the error will still be quite small quite high sorry and if you set the step size too high the arrow will be much much higher they go off out of the plot here so we clearly see a trade-off a bias variance trade-off here essentially where an intermediate value of n helps us learn faster and helps us get lower error in the first over the first 10 episodes for a well-tuned step size parameter and the best values are not the extremes so it's not n this is one and it's also not n is infinity so it's not td it's not monte carlo but it's some intermediate n-step td algorithm okay so make sure you understood that last bit because now we're going to go on to a next very related bit though which is on multi mixed multi-step returns so we've just talked about n-step returns so as i said make sure you understood that part before continuing here and these n-step returns they bootstrap after n steps on a state value st plus n one way to write down these returns is almost recursively they don't quite recurse but you could look at them these returns as doing the following where if you have an n step return you take one step rt plus one and then you add an n minus one step return right so every step you lose one step until you only have one step left and the one step return will then bootstrap on your state value so why am i writing it like this because you can then look at these different cases and we could basically say well on some of these steps we fully continue with the random sample with the trajectory whereas on other steps we fully stop we just say okay that's enough bootstrap here and now i'm going to argue well you could do that but you don't have to do that instead there's a different thing which you could do which is to say you could bootstrap a little bit for instance you could have a parameter which we're going to call labda so we're going to take one step as always and then we're going to take the discount factor but then instead of either continuing or bootstrapping fully we say let's bootstrap a little bit so this is a linear interpolation between our estimated matrix value and the rest of that labda return as we call it this is defined recursively so this lab of return is at the next time step for the same lab so this thing will also take one step and then again bootstrap a little bit before continuing and then again it would take one step and then again boots up a little bit before continuing even farther turns out if you do the math this is exactly equivalent to a weighted sum of n step returns in fact of all of the n step returns from one to infinity these weights do sum to one so it's a proper weighted average of these and we can note by just stepping through a few examples if n is one then this quantity here goes away this is just one so the the one step return will be weighted exactly with a weight of one minus lambda then the two-step return will be weighted but by a weight of lab that time for a minus laplace so slightly less typically labda is a it's a number between zero and one and typically it's a number that is maybe closer to one than to zero but if we set loudness to say a half for simplicity that would mean that our lab of return would basically take a one-step return with weights a half a two-step return with weight a quarter a three-step return with weight one-eighth and so on and so on then we can consider these special cases where if labda is zero we can say that see that this term would completely disappear and this term would just get weight one that means that la da zero corresponds exactly to the standard td algorithm whereas if we consider the case where lab days one then the recursion is full we just have one reward plus discounted next lab the return but if lab day is one this would also just take the full next reward and ends one so that would be exactly the same as monte carlo so we can see we've have the same extremes as before for the n-step returns where lab is zero now corresponds to temporal difference learning one-step temporal difference learning and lab as one corresponds to full monte carlo this is why on a previous slide you saw td0 pop up that was actually referring exactly to this algorithm where we say there's a more generic algorithm called td labda but you can set lablet to zero and then you have your one step td algorithm we can compare these to these n step approaches and we here we plot them side by side for that same multi-step random walk and we see some commonalities first let me draw your attention to lab that is zero which is exactly the same curve as for n is one this is true by definition because this is these are both the one set td algorithm and then similarly we can see for lab is one i promise you that was exactly monte carlo and we can indeed see that that's very similar to the curve here for the 512 and also for the 256 step td algorithms intermediately the curves look slightly different you can see that these this curve for instance branches are slightly differently than that one and they also don't exactly correspond to each other because these n-step approaches always choose exactly one time step in which the bootstrap whereas the labda approaches they bootstrap a little bit on multiple timestamps but you can see the curves actually look somewhat similar and there's a rule of thumb that you can use to think about how these relate to each other maybe you think all these n-step approaches i find a little bit more intuitive because i can reason about two steps but i find it harder to reason about this lab that well one way to think about this is that there's a rough correspondence where we can think of one divided by one minus lambda as roughly being the horizon of the lablet return so for instance if we set labla to be 0.9 then this would be this one minus lab that would be 0.1 and then 1 divided by 0.1 would simply be 10. so we could say that louder 0.9 roughly corresponds to an n-step of 10 and we can see that corresponds here in the plot where we can see that lab.0.8 which would correspond to a 5-step method is indeed quite similar to this 4-step method over here and 0.9 which corresponds to roughly 10 steps is indeed quite similar to this eight-step temporal difference learning method here and they're even colored the same way that maybe makes it a little bit clearer this corresponds as well they're slightly different you could look for instance at the lab.4 which is presumably quite similar to lab.5 which would correspond to this two-step approach and especially for higher learning rates they do look a little bit different but there is a rough correspondence and this is one way to think about these other algorithms and now we're going to talk about some benefits so we've already alluded to this so multi-step returns have benefits from both temporal difference learning and monte carlo and the reason to consider them is that bootstrapping can have issues with bias so one step td is not always great and monte carlo can have issues with variance and typically we can think about intermediate values as being somewhat good because they trade off bias and variants in an appropriate way in addition i talked about this information propagation which has a similar flavor to it so intermediate values what do i mean with that well typically we see things like n is 5 or n is 10 or something as a form or a lab.9 is somehow always magically a good value so these are intermediate values that tend to work quite well in practice okay now make sure that you understood all of the things i was talking about before before continuing to the next section in addition i want to basically attend you to the fact that the next part is uh more advanced and you don't need it to continue with the rest of the lectures so you could also pause here and maybe return to this later if you wish um or you could just continue now of course because it's quite related to the things we were talking about before but i just wanted to draw your attention to the fact that this is going to be in some sense a little bit orthogonal to some of the things that we'll be discussing later okay so having said that let's continue so we talked a little bit before about this dependence on the temporal span of the predictions and maybe you've already realized this these multi-step approaches these mixed multi-step approaches especially that we talked about are actually not independent of spam which means that similar to monte carlo you actually have to wait all the way until the end of the episode before you know your labor return right the lab that trades off statistical properties of the updates but it still has the same computational update that you actually can only construct your full ladder return when you've reached the end of the episode now that doesn't seem very desirable and indeed you might also sometimes want to do monte carlo learning and then that might also not be desirable and conversely temporal difference earning can update immediately and is independent of the span of the predictions so before maybe you took this to be an argument that we should be using temporal difference learning but here i'm going to make a different argument and we're going to ask can we get the best of both worlds can we use these mixed multi-step returns or like other flavors of that including maybe even monte carlo but with an algorithm that doesn't have computational requirements that grow indefinitely during the episode and turns out of course i wouldn't be asking this question if it wasn't the case so i'm going to explain to you how that works for concreteness let's recall linear function approximation where we have a value function that is a linear function so we have a value function defined as an inner product of a weight vector w with some features x and then monte carlo and temporal difference learning the update to the weights can be written as follows for monte carlo learning it's a step size times your return minus current estimates times features and for tempo difference learning is your step size times your tempo difference error thanks for features and then for monte carlo learning we're going to talk about monte carlo first for a bit we can update all states in an episode at once so this is typical because you have to wait for the end of the episode for all states anyway to know the return so it's quite typical for monte carlo loading to then just update all of them at once in one big batch here we're going to use t from 0 to big t or in this case big t minus 1 to enumerate all of the time steps in this specific episode so we're restarting the time to count from zero in every episode essentially and now oh yeah first i just want to record a tabular just a special case where the x vector is just a one-half vector and now i'm going to argue that we can look at these individual updates here so here we have an update on every time step and we just summon over time steps i'm going to argue that you can define and update as follows i'm going to prove this in a moment but first i'm just going to give you the update to build some intuition this update is a very simple update that takes an alpha step size parameter times your one step td error this is not your monte carlo return this is the one step td error times some vector e this e is called an eligibility trace and it's defined recursively by taking the previous trace decaying it according to gamma and ladder and then adding our current feature vector or more generally your gradient x t note special case if lab is zero we do indeed get one step tbtd back so there's a choice here whether you want to do online td which means you update during the episode you could also consider for conceptual simplicity you consider offline td where you still just accumulate these weight updates until the end of the episode but then this algorithm would just be td for label a0 the intuition here is we're basically storing a vector that calls all of the eligibilities of past states for the current temporal difference error and then we're going to add add that to our update to the weights so when ladder is not zero we have this kind of trace that goes into the past it's similar to momentum but it serves a slightly different purpose here which will hold recursively this e t minus one will therefore hold the x t minus one the features from the previous state and but property decayed and also x t minus two and so on and so on each decade even more so this is kind of magical i haven't yet shown you why it works or how it works but this means that if this works we can update all of the past states to account for the new temporal difference error with a single update we don't need to re-recompute their values we don't need to store them so this would be an algorithm if it works that will be independent of the temporal span of the predictions we don't need to store all of the individual features x t x t minus 1 x t minus 2 and so on and so on instead we just merge them together by adding them into this vector and we only have to store that vector this idea does extend to function approximation i mean this is already the linear function approximation case but the so in general xt does not have to be one hot but it can actually also be a gradient of your value function for the nonlinear case how does that look well here we're just going to show that kind of intuitively again so there's a markov decision process here in which we only ever step down or to the right in the end we reach some goal states from which we get a reward of plus one and this then we transition all the way back to the top so the trajectory here will be somewhat random the true values are there's discounting so the true values are larger the closer we are to that goal if we then imagine doing only one episode td0 would only update the last state so the state in which we saw that one reward td labda would instead update all of the state values along the way less and less so with labda this is true if you do the version that we had before with the louder return but turns out the exact same thing is also true if you use these editability traces and the way to think about that is that at some point we see this temporal difference error where we finally see this reward of plus one that we've never seen before and then we multiply that in with all of these taking into account all of these previous features that we've seen in this case it's tabular so all of the features are one hall but so basically we're storing we're storing all of the states in this vector and we're updating it in one go now i'm going to show you how that works and why it works so we're going to consider monte carlo first and later i'll put the lab.parameter back in so we're just going to take the monte carlo error and we're first going to rewrite this as a sum of temporal difference error so to start let's first write it out one step so we get our first step our reward t plus one and then our discounted next return next step return and we're just taking along the minus v parts from uh from the left hand side now this already looks similar to temporal difference error but it's not so for a temporal difference error we should bootstrap so what we can do is we can just add that in so we basically add gamma times v of the next state in and then we also have to subtract it so we're basically adding zero but this allows us to write this as a temporal difference error plus a different term and this term notice that this is exactly the same term as before but just at the next time step so now there's a t plus one rather than a t so then we can write this as an immediate simple difference error plus a discounted monte carlo error at the next time step of course we can repeat this so the monte carlo error at the next time set will just be the td error at the next time step plus the discount the monte carlo error at the time step after that so we repeat this over and over again and we notice that we can write excuse me we can write the total monte carlo error as a sum of temporal difference error of one step temporal different error and i'm just going to put this basically aside for now and i'm going to use this on the next slide so now we're going to go back to this total update so at the end of the episode we're going to update all of our weights according to all of these monte carlo returns that we've seen constructed so far which we've only were able to construct at the end of the episode now i'm going to plug in this thing that we derived in the previous time step sorry the previous slide where we're replacing this monte carlo error with this summation over time of temporal difference errors note that the summation only starts at time step t which is the time step from which the monte carlo return also starts and now we're going to use a generic property of double sums we have an outer sum that sums from t 0 to t minus 1 to big t minus one and an inner sum that starts at this variable t and then continues to the same endpoint turns out if you have a double summation like this where the inner index depends or the starting point of the inner index depends on the outer index you can turn them around and you could first sum over all of the outer indexes instead and then sum over the what used to be the outer index sorry so now we're summing over all of the what used to be the inner index k and then we we sum off what used to be the outer in x t but only up to k so instead of starting at t starting k at t and then going up we do all of the k but then we only do the t up to k this turns out to be exactly equivalent it's a generic property of these double sums and i'll show you this intuitively in a moment as well and then we notice that the temporal difference error delta k does not depend on this variable of the inner sum it doesn't depend on t so we can pull that out and then we just say oh this rest thing that we have here this summation from t is zero to to k it's a discounted sum of feature vectors and we give that an a we just call that e where we notice that e actually only depends on time steps on time step k and before so this means we can write this down as follows where we now swap k for t because we only have one variable that we're summing over now so we can just rename k to t and that means that the original summation can be rewritten in this form so originally we had for every state we had this update of a return looking into the future right this is why we had to wait all the way until the end of the episode because we have to wait for gt to be completely defined and we swap this around to a summation where we have our same step size and our temporal difference error delta t and then this weight vector which kind of looks into the past it stores all of the feature vectors that we've seen in previous states and we just kind of went through this mechanically right we haven't actually taken any magical steps you can just do this step by step of course coming up with this derivation is it may be a different matter but you can now follow this step by step and see that this is indeed true so that means that our total update can be written as this where this eligibility trace vector was defined as the summation of the time step t of our discounted feature vectors this we can also inspect a little bit more and we can for instance pull off the last time step which is just xt note that when j is equal to t then this discounting goes away because it would just be your discount to the power 0 which is 1. so we get this term that we pull off and then the summations only to t minus 1 rather than enter t now we can also pull off one discount factor exactly one so there will be now a minus one here and then we can notice that here e t was defined of a summation of something that goes to t and inside there's a t and now we have something that only sums up the t minus one and inside as a t minus one but otherwise it's exactly the same so this thing must be by definition e t minus one which means we can write this recursively this summation as a discounted previous e t minus one plus the current xt and then et we're going to call that an eligibility trace so the intuition here is that every step it decays according to the discount and then the current feature is added because we're doing full monte carlo so discount here is appropriate for propagating information backwards because you shouldn't propagate the information more than it is used in these monte carlo returns which came in earlier states and that should take into account the discounting so then summarizing we have this immediate kind of like update for every time step which is completely well defined on time step t we have all these quantities available for us at time step t then the monte carlo update would sum all of these deltas over all of the time steps in the episode and then apply them this is the offline algorithm so in this case even though we can compute these things along the way we can now compute this sum incrementally as we go so we don't have growing memory requirements but then in this specific algorithm we're still applying them at the end of course you might not already be thinking but that feels unnecessary can't we just apply them during the episode and yes indeed you can so this is indeed an interesting extension that you can now start doing something that would be equivalent to monte carlo if you would wait all the way until the end of the episode and only then apply these differences but you could already start learning journey during the episode so we haven't just reached an algorithm that has the property that is independent of span which is a computational property but we've actually also arrived at an algorithm which is now able to update its predictions during long episodes even though we started with the monte carlo algorithm the intuition of this update is that the same temporal difference error shows up in multiple monte carlo errors in fact in all of the monte carlo errors for states that happened before this time step and then what we do is basically we group all of these states together and apply this in one update you can't always do that but you can do that here now i'm going to look at that that double summation thing a little bit and show you why that works and for doing that we're going to concretely consider an episode with four steps so it says delta v here it should have written its delta w there so we've noted before that delta w for an episode is just the summation of all the monte carlo errors with their feature vectors and we've also noted that we can write each of these as a summation of appropriately discounted temporal difference errors multiplied with that same feature because we're just pulling off out this part the monte carlo error and then essentially all that we're doing is instead of summing across these rows we're going to sum across the columns which will look like this where basically we notice that across the columns the the temporary difference error is always the same so instead we're going to merge all of these appropriately discounted state feature vectors and we're going to merge it into a vector called the editability trace feel free to step through that of course much more slowly than i'm doing right now and then we can do mixed multi-step returns as well by putting the lab the parameter back in how that works is we had our mixed multi-step return g labda and it turns out if you write this as a sum of temporal difference error you can go through that yourself in a similar way that we did the monte carlo returns it turns out you get this summation which no longer now no longer just says your discount factor but it has a quantity ladder times gamma otherwise it's exactly the same thing so it turns out we can write these labdad returns as also a summation of one step temporal difference error but with gamma times lambda where we had gamma before and that means if we go through all of those steps again that we did before for the monte carlo case we get exactly the same algorithm except that the temp that the eligibility trace now has a trace decay gamma times lambda rather than just gamma so you feel free to go through all of that yourself to convince yourself that this is actually the case and that means that we can then implement this algorithm and actually also maybe apply these updates online as we go this is called an accumulating trace because every time we add this x there it turns out in the literature there's a couple of other traces as well which i won't cover now but they are mentioned in french and susan lombardo in the book the equivalences here between the monte carlo updates and then doing this with these traces is only completely exact if we update offline that means we wait we store all of the weight updates we can add them together already but we don't actually change our weight yet but there are also trace algorithms that work and are exactly equivalent in some sense for online updating so these traces look slightly differently as i mentioned i won't go into that but i just wanted to make you aware that they exist okay that brings us to the end of this lecture in the next lecture we will be talking about applying these ideas to control to optimize our policy thank you for paying attentionhi and welcome to this fifth lecture on this course in on reinforcement learning my name is adavan hasselt and today i will be talking to you about model free prediction there's a lot of background material on this slide and this is a fairly long lecture you don't have to worry about reading all of this at once the most important chapters for this lecture are chapters five and six and we will cover some material from these other chapters as well but some of that will be shared with the subsequent lectures so this is actually background material for a couple of lectures in a row we will just not exactly go through these in exactly the same sequence as the book does this is why we list a fairly large chunk of background material here feel free to defer some of that reading until later in fact it might help the understanding to go through the material not all at once but to revisit it later also don't forget to pause during the lecture i mean sometimes i will ask you a question ask you to think about something and of course that's a good occasion to actually pause for a second and actually reflect maybe write some stuff down but also like i said this is a fairly long lecture so feel free to make use of the fact that this is a recording and therefore you can pause and maybe even take a break or maybe even consume the lecture over more than one day if that's uh works for you i do encourage you to not wait too long between looking at different parts of the lecture in order not to forget the beginning when you get to the end first i'm just going to recap where we are we're talking about reinforcement learning which we defined as being the science of learning how to make decisions in which there's an agent interacting with an environment and the agent takes actions and the environment will be observed by the agent and then internally in the agent we will have a policy a value function and or a model where in any case we should have a policy because the agent should pick its actions somehow and the general problem involves taking into account time and consequences because these actions can change not just the immediate reward but also the agent's state and also the environment state which means that subsequent rewards might be affected by actions that you've taken that you've taken earlier to recap a little bit where we are in the course right now in these last two lectures we've seen planning by dynamic programming diana told you a lot about this which is all about using computation to solve a known problem so we have a markov decision process or a model if you want to call it that and then dynamic programming is a mechanism to be able to infer accurate predictions or optimal policies for such a problem this and in the subsequent lectures we're basically going to relax this assumption that you have access to this true model and instead we're going to use sampling so we're going to use interaction with the world and we call that model-free and at first we'll talk about model-free prediction in this lecture which is the process to estimate values when you do not have the markov decision process you don't know what it is but you can interact with it this of course is the case when you're for instance in the real world you could imagine that the world maybe has some sort of a really large markov decision process underneath it but you don't have immediate access to that so instead all that you can do is interact for after model 3 prediction we can talk about model frequent control which is the process to optimize values rather than estimating them so please keep in mind that this lecture is about estimating so we're not going to talk about policies much and then we will also talk a little bit about function approximation and some deep reinforcement learning in this lecture and then a little bit more in subsequent lectures and especially deep reinforcement learning will be deferred quite a bit we'll briefly touch upon it finally we will also talk in these upcoming lectures on about off policy learning which is also a prediction task but this is this term refers to making predictions about a policy different from the one that you're following more on that to follow later also in later lectures we will talk about model-based learning and planning policy gradients and ectocritic systems and of course more deep reinforcement learning finally we will cover some advanced topics and current research but only much later okay so let's get started our first topic will be monte carlo algorithms and i'll explain in a moment what that means so the point here is to use sampling we're going to interact with the world and this will allow us to learn without a model if we're sampling complete episodes in reinforcement learning we call this monte carlo so that's a specific usage of the term monte carlo sampling is also used to refer to other things in machine learning in general in reinforcement learning when people say monte carlo they typically mean sample complete episodes an episode is a trajectory of experience which has some sort of a natural ending point beyond which you're not trying to predict any further we'll see examples of that in a moment this is a model free approach because you don't need any knowledge of the markov decision process you only need interaction or samples to make that concrete let's start with a simple example that we've actually seen before in lecture two the multi arm bandit so in the multi-arm bandit we have a bunch of actions and we're trying to in this case just estimate the action values in lecture two we talked about optimizing these action values by picking a smart exploration policy for now we're just talking about model-free predictions so for now we're only interested in estimating these action values so the true action value here is given on the right hand side which is the expected reward given an action and then the estimates at some time step t is written somewhat verbosely here but it's basically simply the average of the rewards given that you've taken that action on the subsequent time steps and we can also update this incrementally we also briefly discussed this in lecture two where you have some sort of a step size parameter alpha and you add to the action value estimate that you have right now so qt of 80 you add the step size parameter times an error term and this error term is simply the reward that you've just observed minus our current estimate all the other action values stay unchanged and then if you pick this step size parameters to be exactly one over the number of times you've selected that action then this is exactly equivalent to the flat average that is depicted above you may have noticed that there's a slight change in notation we've now moved to the notation that is more common when we talk about sequential decision processes so markov decision processes and such where we typically increment the time step immediately after taking the action in the banded literature it's more conventional to denote the reward as arriving at the same time step as taking the action but in reinforcement learning in a more general case we typically increment the time step immediately after the action which basically means we interpret the reward to arrive at the same time as your next observation that's just a small notational note to avoid confusion between this lecture and the earlier lecture on bandits now we're going to extend slightly to make this more general and we're going to consider bandits with states for now the episodes will still remain to be one step long as in the bandit case before and this means that actions do not affect the state transitions which means that if you take an action you will receive some reward and then you will see a new state but this this new state actually doesn't depend on your action so there are now potentially multiple different states so that's a difference from before but they don't depend on your actions and that means that there's no long-term consequences to take into account and then the goal is to estimate the expected reward condition not just on action but also on the state so this is in some sense a slight extension from the normal banded case that we discussed before and these are called contextual bandits in the literature where the state is then often called a context so state and context in that sense are interchangeable terms now we're going to do basically make an orthogonal step in this lecture and we're going to talk about function approximation a little bit and then we're going to return back to these bandits with states and talk about how these things are related before we then go to the sequential case so we're talking about value function approximation to be more precise so far we've mostly considered lookup tables where every state has its own entry or every state action pair has its own entry so think of this as a big table stored on your uh in the robot's brain or in the agent's brain and for every state and action you might see you just have a separate entry that you might update but of course this comes with some problems because there might be way too many states and their actions to actually store this effectively in memory but even if you could store this in memory that might not be the best idea because it might be way too slow to learn the value of each state if you're going to do that completely independently and individually in addition individual states are actually not often fully observable if you talk about these environment states at least so so far i've just set state i didn't actually say specifically what i meant with state i'll talk about that more in a moment but in the simplest case you could consider that the food the environment state is fully observable so that the observation maybe is the environment state and then you could also use that same state as your agent state in that case the aging state the observation and the environment say it could all be the same but even if that's the case then it could be very large but of course often it's also not the case now you could still have a finite agent state and maybe store these separately but then you still suffer from this other problem that might be very slow if you have many different agent states so our solution for those problems is to use function approximation so we write vw or qw where w is now some sort of a parameter vector and we're going to update these parameters instead of updating cells in this in a big table so the parameter vector w will be updated using monte carlo or temporary difference learning which are two algorithms that we're going to talk about in this lecture and the idea would be that we can hopefully then generalize to unseen states because this function should be defined everywhere and the idea is that if you pick a suitable function then if you update a certain state value in some sense you update your parameters that are associated to that state that then the values of similar states would also automatically be updated for instance you could have you would be in this state you could identify that states by looking at certain characteristics certain features and then you could update the values of those features and then maybe if you reach a different state which is different in some ways but it shares a couple of those features maybe we can make a meaningful generalization and learn something about its value before even visiting it now here's a note first on states what do i mean with states so we're not going to necessarily assume the environment state is fully observable so i'm going to just going to recall that there's an agent state update function which takes the previous agent state st minus 1 the action at minus 1 and your current observation ot where the reward could be considered part of that observation or t or we could also spell that out explicitly and we can write it as an input to this agent update function and then our subsequent agent states st is just a function of these previous inputs we talked about this in the first lecture you might recall so henceforth we will use stu whenever we write s rest to denote the agent state you can think of this as either some just a bunch of numbers inside your agent a vector or in the simplest case it could also simply be the observation and indeed it could be as i mentioned before that the environment states if the environment is fully observable is is observable in every step so that the aging state could be equal to the environment state but that's a special assumption that won't be the case all the time so we're just going to talk about agent states whenever we say state for now we're not going to talk about how to learn this agency the update function as you can imagine this can be quite important to have a good agency update function sometimes you can hand construct one sometimes maybe you're much better off if you can learn this we will cover that in later lectures but for now we're just going to set that aside and we're just going to assume we have some sort of an agent update function if you think that's easier to understand the algorithms feel free to consider state whenever you see this in one of the equations just to be the observation for simplicity okay now we're going to talk about linear function approximation there's a couple of reasons for this first this makes things a little bit more concrete if we can talk about a specific function class in addition there's things later that we can say for the linear case that we can't actually completely say for the non-linear case it's easier to analyze theoretically we won't do that yet for now but we will do that later and it's good to have this special case in mind so it's a useful useful special case in which we have a linear function and basically what we're going to assume is we're going to have some fixed feature vector so note we already assumed that we have some sort of a fixed agent state update function so we're going to set that aside where the states come from and in fact now we're even going to set aside states themselves and we're going to just say well in addition to that we have some sort of a feature mapping that turns the state into a bunch of numbers so it's just a vector with a bunch of elements and we're going to consider that for now to be fixed later we might consider learning it but for now it's just a fixed mapping we're also introducing a little bit of shorthand where we simply write x t whenever we mean the features of state at time step t please keep that in mind for instance features could include the distance of a robot from different landmarks or maybe some trends in the stock market or maybe the ps and pawn configurations in chess you can come up with these features by hand sometimes or later we will discuss ways to find them automatically then the linear function approximation approach takes these features and simply defines our value function to be the inner product or dot product between our parameter vector w and the features at the time step x of the state s that we see um probably unnecessary but um the slide is also reminding you what the inner product looks like it's just a sum over the components and it's multiplying each of the features with the associated weight now we can talk about how to update those weights and for that we have to pick some sort of an objective in this lecture we're talking about predictions our objective will be to minimize this loss which defines a squared distance between our current estimates in this case according to this linear function and the true value function v pi obviously we don't have v pi so we're going to replace this with things that we can use but for now keep this objective in mind if we could then compute stochastic gradients for this objective this would converge to a global optimum of this loss function because this loss function is as they say convex and there's only one optimal solution which will so this uniquely defines the optimal uh parameter vector w that does not mean that we can reduce this loss all the way to zero it could be that the features are not good enough to be able to accurately predict the value for every state if we do stochastic gradient descent the update rule is very simple we first note that the gradient of our value function with respect to our parameters w are is simply the features we see here on the left hand side so at time step t if we see state st the gradient of our value function on that state will simply be xt and then our stochastic gradient update if we would have the true value function v pi would simply be to add this term which is the step size parameter alpha times the error term the prediction error times the feature vector and we can use this to update this parameter vector w on every step but of course we have to replace v pi with something that we can have because we don't have e pi i'll get to that in a moment first i want to say that the table lookup case that we've considered for instance for the banded lecture earlier is a special case we can enumerate all of the states of course in order to store these in a in a big table you need a finite amount of state of of states otherwise you can't store them separately and then we could consider the feature vectors to simply be a one hot feature that has zeros on almost all of the components except on the component that corresponds exactly to the state that we see so that means we have exactly as many states as we have sorry exactly as many feature components as we have states and then we note that this means that the value function estimates under the linear function approximation would then simply pick out the weight that corresponds to that state so that means that the weight for that state will essentially be your value estimate for that state okay now we're going to go back to the reinforcement running case so that was kind of like more generic about function approximation and we're going to go back to these monte cardo algorithms basically continuing from before so note that we were dealing with bandits with states and now we're basically going to make q a parametric function for instance a neural network or a linear function and we can going to use this squared loss which we now also multiply by a half that's just for convenience and then we could consider the gradient of this so similar to before we're going to update our parameters w so our new parameters wt plus one will be our previous parameters wt minus a small step size times the gradient of this loss we can then write out the definition of the gradient of that loss which we do here on the second line and then we note that this expectation doesn't actually depend on our parameters so we can just push the gradient inside we get this update which is our step size times an error term reward minus our current estimate times the gradient of this action value and then we can sample this to get a stochastic gradient update as we saw before and the tabular case would just be a special case which is indexes into the corresponding cell for the state action pair so we basically just use the exact same things that we've seen before for the bandit with state setting and we can use a stochastic gradient updates to do the prediction in these cases this also works for very large state spaces this is just regression you could do linear regression you could do nonlinear regress regression which you've probably covered in previous courses so we won't go into a lot of detail on that but it's a valid update and it will converge to the right estimates where these estimates are i'm reminding you limited in the sense that you can't actually expect all of your values to be perfect because it might be that your function class is just not rich enough to actually get all of the values completely accurate in every state but this process will converge under suitable assumptions to the best parameter vector that you can find so again for linear functions we basically now are going to extend the action value approach also to linear functions where we're going to assume that we have features for each state action pair and then we can just multiply rate parameter w with these features which means that the gradient of our action value function will simply be the features for that saved action pair which means that this stochastic gradient descent update for our weights will then look like this where we simply replace the gradient with those features so over the linear update this update corresponds to a step size times your prediction error times a feature vector and for the non-linear update you have very similarly your step size times your prediction error times a gradient and a lot of our next algorithms would look exactly the same they will just change certain aspects of this for instance here we're still considering the bandit case we're just considering learning expected rewards so there's no sequences yet and that's where we're going to go next now we're going to consider sequential decision problems so we're still doing prediction our goal is just to predict the value of a policy in subsequent lectures where i will be talking about control optimizing the policy but for now we're sticking to prediction and now we're just going to sample trajectories under our policy and of course not shown here on the slide but of course the probabilities of these trajectories also depend on the underlying dynamics under the underlying markov decision process then so maybe somewhat obviously we can extend the banded approach to full returns by simply just sampling a full episode and then constructing a return i'm reminding you that a return is simply the accumulation of rewards so we have gt the return from time step t into the future as defined as the immediate reward rt plus one and then the discounted next reward rt plus two and so on and so on until in this case the end of the episode which arrives at sometimes the big t which is in the future a return will only stretch as far as the episode and then after the episode is done we imagine we will be reinitialized in the states and we can go through this whole process again and then the expected return which is our goal is defined as simply the expectation of these returns so similar to the bandit with state setting we can sample this and we can use this instead of the expected return as a target in our updates this algorithm that does that is called monte carlo policy evaluation and it's covered in chapter five from such an ambardo now i'm going to walk you through an example to give you a little bit more of intuition of how that algorithm works in practice this example is in the game of blackjack and blackjack is a card game in which the goal is to get more points than an opponent called the dealer we're going to go first and we're going to try to accumulate as many points as we can but not more than 21. so if you get more than 21 points you go bust as they say and so therefore basically your your goal is to get as close to 21 without going beyond it to do so you're going to draw cards and each of these cards is going to be worth a certain amount of points the number of cards are simply worth how however like large the number is so a card with a three or four is worth three or four points all of the picture cards the jack queen and king are worth 10 points and the ace is a special card it's worth either 11 points or you can pick it to be worth one point this is useful for when you draw a card and you go above 21 if you then had an ace you can say ah no my ace is now no longer worth 11 points now i'm going to make it worth one point instead and now you're below 21 again we're going to formalize this problem in in the following way where we're going to enumerate states or so we're going to go for a tabular approach and this state will consist of our current sum and this current sum is the sum of the cards you have so far and this will be between 12 and 21 for reasons that i'll explain in a moment i've already said if you go beyond 21 you have already gone bust so then the episode ends so that state is unimportant but we're going to start with any number between 12 and 21 and in addition to that we can also see the dealer's card so the dealers card we're only seeing one of them the dealer is going to play after you so they are going to draw more cards after you're done but you can already see one of these cards and this is informative to tell you whether you maybe should continue or not and then in addition to that we also have an additional state variable which tells us whether we have a usable ace which basically just means do we have an ace and can we make that a's worth 11 without going above 21. so say you have 16 points let's say you have an 8th and a 5 and let's say you then draw a 10. this would bring you to 26 points which as i explained to to make you go bust but then you can say i know my ace is now only worth one point and i'm back at 16 points and i can go again but the state will have changed because now you no longer have a user ways in terms of the action space there's two different actions you can do you can either stick at which point it's now the dealer's turn and they will resolve this will then terminate your episode or you can draw which means you just take another card if you draw you can draw again in the next step or you could stick in the next step when you stick the episode always terminates and you get a plus one if then the dealer doesn't get a higher sum than you or if the dealer goes bust which is also possible so if the dealer goes above 21 they lose if they don't go above 21 but they have fewer points than you they also lose and you get plus one if you happen to arrive at exactly the same number you get zero but if the dealer manages to get more points than you without going above 21 then you get minus one if instead you draw if you go above 21 and you didn't have a usable ace you cannot avoid this from happening then you get minus one and the episode terminates immediately the dealer has now won otherwise you get zero and then the game continues so you could draw again or you could stick as i mentioned you start with at least 12 points this is simply because if you have fewer than 12 points you should always draw more cards because you can never go bust and therefore if you say have 11 points there is no card that could bring you above 21 because even if you draw an ace you could always choose it to be worth one so you can always get more points so you can basically think of this as a process a stochastic process that brings you to the initial state of your episode note that the state description here is slightly partial observable because we're just giving you a number so you don't actually know what that consists of and even knowing whether you have a usable ace or not doesn't actually give you all the information that you could have because for instance you could have two aces and then that will be hidden from you so there's some slight partial observability here but that turns out not to be a big factor then what we do is we run monte carlo learning so we're going to generate a whole bunch of episodes and we're going to sample the returns for those episodes for some fixed policy and then we're going to generate these plots and i'm going to explain these in a second which show your value estimates for that policy and then of course in later lectures we can talk about oh how should we then maybe improve our policy to do better but this is a reasonable policy in which i believe you draw if you have fewer than 17 points and otherwise you stick or it's something similar um and what's shown here's four plots and i'll explain what these are first i want to draw your attention to the bottom right where we see what the axes are on this plot so one axis is which card the dealer is showing which is either an ace or a two or a three and one or a ten where we merge all of the picture cards just into a ten as well because they're all worth 10. and on the other axis we see the current sum that we have it's either 12 13 and 1 or 21. these z-axis the height is basically the estimated value we see it's always between -1 and 1 this is because the maximum reward you can get during an episode is plus one at the end or the lowest reward is minus one at the end all of the intermediate rewards if there are any are always zero because if you draw but you didn't go bust you just get zero so the total return for each episode is between minus one and one or it's either minus one or zero or one and now i want to draw your attention to these plots and i wanna go i'm going to ask you a couple of questions that you can think about so feel free to pause the video if you want to think about these things and in particular i want to draw your attention now to the top left plot the left column here corresponds to having seen ten thousand episodes the right column corresponds to having seen half a million episodes now the top row corresponds to having a usable ace and the bottom row corresponds to not having a usable a's interstate so the first question i want to ask you is uh why does this top left plot look so much more bumpy than the plot to its right and then the second question is why does it look more bumpy than the plot below it so feel free to think about that for a second and then i'm going to give you uh my explanation so maybe the first one was uh maybe somewhat obvious so after 10 000 episodes we don't have a lot of data for each of these states yet but after half a million episodes we have accumulated by now quite a bit of data for each of these states so our value estimates have improved so maybe the difference between the left and the right was somewhat obvious maybe the difference between the top and the bottom is a little bit less obvious but the reason for that is i'm going to argue that the states at the top in the top row are actually less common than the states in the bottom row in a normal deck of cards there's 52 cards out of which only four are aces so states in which you have an ace are actually comparatively rare so even on the left we've seen 10 000 episodes in total that doesn't mean that every state has seen the same amount of episodes and in fact the states in which you have an ace may have been visited much less often in some cases now finally i just want to draw your attention to the shape of the value function where we we see maybe somewhat expectedly that if your own sum is high then the value function becomes higher and in fact if your sum is 21 then it's quite close to plus one except if the dealer is showing an ace because if the dealer is showing an ace it's actually not that unlikely that the dealer will also get to 21 at which point your return will be zero rather than plus one okay so this is just an example of monte carlo learning and how you could use that to find the value of a policy and maybe somewhat obviously we can then later use this information to then improve our policy but we won't go into that yet so what we've seen here is that monte carlo algorithms can indeed be used to learn value predictions unfortunately when episodes are very long learning could be quite slow so the example for blackjack was an example in which episodes are actually very very short right they only take like maybe one or two or three actions but they won't take hundreds and hundreds of actions or maybe even more than that but if they do and you have to wait all the way until the end of an episode every time before you can start updating that might be tedious so we have to wait until an episode ends before we can learn why do we have to do that well because the return is not well defined before we do right so we're using the full return of an episode to update towards and that means we have to wait until the episode ends before we even have the thing that we want to use in our update in addition these returns can have very high variance in some cases especially if episodes are long so are there alternatives are there other things we could use other algorithms that maybe don't have these downsides and of course i wouldn't be asking this question if there wasn't an affirmative answer so this brings us to one of the most important concepts in reinforcement learning called temporal difference learning so i'm just going to start by reminding you of the bellman equation that we've talked about or diana actually talked about at length in the previous lectures the bellman equation relates the value of a state with the value of the next state or the expected value of the next state and this is actually a definition of the value of a policy so the value of a policy is defined as the expected return but turns out to be exactly equivalent to the expected immediate reward rt plus one plus the discounted true value of that policy in the next state st plus one we've seen that you can approximate these values by iterating basically turning the definition of the value into an update so the difference here is now that the v pi within the expectation has been replaced with our current estimates v k because we're doing this in iterations maybe across all states at the same time so we denote this with this iteration with some number k and then we update our value function by replacing maybe all of them at the same time you could do this for all states at the same time with a new estimate vk plus one which is defined as the immediate reward rt plus one plus the discounted current estimate of the next state value vk and we've seen that these algorithms actually do learn um and they they do find the true value of a policy now we can see there there's on the right hand side there's an expectation but we could sample that so maybe we can just plug that in and we can just say oh maybe we just see a sample rt plus one plus the discounted value of the next state st plus one and then use that well maybe you don't want to update all the way there so instead we're going to argue that's going to be too noisy so instead just take a small step so this is this now looks very similar to the monte carlo learning algorithm but instead of updating towards a return full return we're going to update towards this other target which is the reward plus the discounted estimate for the next state so the change here between monte carlo learning and this algorithm is that we've replaced the full return with something that uses our current value estimates instead note that i've written down the tabular update here but you can extend this in the similar way as we did for monte carlo learning to function approximation or actually we did sort of bandage with states but the bandits with states could be replaced with monte carlo learning by simply swapping out a reward for the return and then similar here we could swap out that return for this new target so just to recap we're in the prediction setting we're learning v pi online from experience under a policy pi and then the monte carlo update looks like this the tabular monte carlo update we have some state some value estimates maybe vn or vt you could also call this i'm calling it n here i could have also used k as before because these updates cannot actually happen at time step t because the return is not yet known completely at time step t right we only know the return at the end of the episode which might be some data time step so instead of saying which time step that actually is i'm just saying oh there's some iterate iterative procedure here and we're updating our value function by using the return temporal difference learning which is that new algorithm which we just talked about instead uses this this new target which just unrolls the experience one step and then uses our estimates to replace the rest of the return and this is called temporal difference learning because the error term here is called a temporal difference error which looks at one step into the future and looks at the difference in value from what we currently think and comparing that to one step into the future this temporal difference error which is simply defined as rt plus one plus the discounted value of this state st plus one minus our current value estimate of the value at st is called the temporal difference error and we typically denote this with a delta so keep that in mind delta t is the temporal difference error and it's defined as as this so now we can talk about these algorithms and maybe get a bit more intuition by thinking about them how they work with the problem at hand so dynamic programming works like i like this there's some three of possibilities here you're in some states and you consider all the possible possibilities that might happen next states here are denoted by these white nodes actions are black smaller nodes and then what we see here is effectively that sorry the dynamic programming looks at all possible actions in this case too and then also at all possible transitions for each action so in this case each action can then randomly end up in two different states so there's four states in total that you could end up in after one step and dynamic programming considers all of these which of course requires you to have a model that allows you to consider all of these possibilities conversely monte carlo learning takes a trajectory that samples an episode all the way until the end this terminal state denoted here with a green box with a t in there and then it uses that trajectory to construct the return and updates the value of that state at the beginning towards that return and of course you could also update all of the other states along the way towards the return from that state and then this new algorithm temporal difference learning instead only uses one sample so we see some commonalities here with with dynamic programming in the in the sense that we're only doing one step but it it does sample so it doesn't need a model so there's some commonality with dynamic programming in the sense we're only doing one step deep and there's some commonality with monte carlo learning in the sense that we're sampling so we call this usage of our estimates on the next time step bootstrapping this is different from the use of the term bootstrapping as the statistical bootstrap which refers to taking a data set and then resampling from the data set as if it's the underlying distribution it has nothing to do with that in reinforcement during bootstrapping typically refers to this process of using a value estimate to update your value estimate this is indicative of pulling yourself up on your own bootstraps essentially and it's good to to keep that in mind that that's just the the term for doing that and that means that under this terminology monte carlo learning does not bootstrap it does not use value estimates to bootstrap upon to construct its return its targets but dynamic programming does and temporal difference learning also does these both use our current value estimates as part of the target for their update and then in addition additionally we can think about sampling where we similarly see the monte carlo samples but now dynamic programming does not sample it instead uses the model and temporal difference learning does sample so we see we have these three algorithms with different properties and of course we can apply the same idea to action values as well where we have some action value function q and we simply do exactly the same thing we did before where we take one step and now we also take the subsequent action immediately a t plus one and we can use that this to then construct the temporal difference error exactly in the same way as before all that i did here is essentially replace every occurrence of a state with a state action pair index on the same time step this algorithm is called sarsa because it uses a state action reward state and action this name was coined by rich saturn now in terms of property templation learning is model free it doesn't require knowledge of the marketization process and it can therefore learn directly from experience and interestingly it can also learn from incomplete episodes by using this bootstrapping this means that if the episode is really long you don't have to wait until all of all the way in the end of the episode before you can start learning and this can be quite beneficial because then you can also learn during the episode now the extremist case of this that you could consider is maybe what if your lifetime is one big episode right what if there is no termination and some models are indeed effectively um formalized as such and then it becomes essential to be able to learn during the episode you can't just wait until the end of the episode because there's only one episode now to illustrate the differences between these algorithms monte carlo and temporal difference learning i'm going to step through an example which is called the driving home example and this one's also due to satsang ambarto so how does that look we're going to enumerate a couple of states small number of states and the idea is that we start at the office and we want to go home now at first we're going to talk about the columns here so the first column shows the state we're in the second column shows the elapsed minutes so far the difference in each step on these elapsed minutes you can consider your reward so between leaving the office and reaching the car we could say five minutes have passed and we could call that our reward we could basically say oh the reward on this transition was five and we're just predicting here so we don't actually care about the sign of the reward whether it's minus five if you would like to maximize the the speed you might want to minimize the minutes or something like that we don't have to worry about that because we're just doing predictions so we're just saying there's a reward of five along the way then the column after that the third column the predicted time to go is our current value estimate at the state that we're currently in so when we're in just leaving the office we predict it's 30 minutes in total to get home this is the accumulation of the rewards along the way then just as a helpful uh mental helper there's a final column here the predicted total time and this is simply a sum of the previous two columns this adds together how many minutes have already passed with the predicted time still to go because this will give us a feeling for how that total time is changing so when we're leaving the office as i mentioned so by definition zero minutes have passed from leaving the office and we're predicting it's still 30 minutes but then when we reach the car we notice it's raining and maybe that's uh bad because maybe that means that it tends to be busier on the highway and therefore even though five minutes have already passed maybe we're also a little bit so to get to the car we still predicted still 35 minutes so more than before which means our total predicted time has actually gone up to 40 minutes so the way to interpret this is that the reward along the way was five and then the new prediction is 35 then in our next stage we exit the highway the total amount of time elapsed so far is 20 minutes so you can think of this as the reward along the way was 15 minutes because it was five minutes when we reached the car now it's 20 minutes and from now we predict it's another 15 minutes to go which means that the predicted total time has actually gone down a little bit maybe it was less busy on the highway than we thought and things went a little bit more smoothly than we thought but then we exit the highway and we find ourselves behind a truck 10 minutes later so another 10 minutes have passed the reward could be considered to be 10 and from this point we we consider it another 10 minutes to go so the total predicted time has gone up again to 40. then at last we arrive at the home street 40 minutes have already passed so another 10 minutes have passed and we predict it's still three more minutes to go so our total predicate time has gone up again a little bit to 43 but our current prediction turns out to be completely accurate and we arrive home after 43 minutes so another three minutes now what do these different algorithms do well the monte carlo algorithm would basically just look at the total outcome and therefore it would then update all of these states that you've seen so far to take into account this new total outcome it basically just looks at the whole sample and it says well when you were leaving the office you thought it was 30 but it was actually 43. when you reached the car five minutes had passed you thought it was still 35 minutes more which means our total prediction was 40. this should have been 43 so that means that instead of predicting 35 you should have predicted 38 perhaps run when reaching the car and similarly when exiting the highway or when reaching the secondary road when we got stuck behind the truck we have to update these predictions upwards that's what monte carlo does if we then look at the right hand plot for temporal difference learning it looks a little bit differently when leaving the office we then reached the car and it was raining and we predicted it was still more time to go there five minutes already passed and we thought it was 35 more minutes so when we said oh from the office it's just 30 minutes now we're thinking when we've reached our car no actually it's more like 40 minutes in total so we should update that previous state value upwards to 40. and you can immediately execute that you could immediately update that state value update but then when we reached the car and it was raining we thought it was 35 minutes but then when we exited the highway 15 of those minutes had passed and then we thought oh actually from this point onwards it's it's not that long anymore i can go back a slide we're here now we were predicting from reaching the car that it was 35 more minutes but then when we exit the highway we notice it's actually only 15 minutes later than it was before and we think it's another 15 minutes to go that means that instead of 35 this should have maybe been 30 instead so it tells you to actually update that one down so the purpose of showing you this is not to say that one is right and the other one is wrong but it's to show you the differences between these different algorithms how do they operate we will see more examples later on as well so now we're going to compare these algorithms a little bit more and we're going to talk about the advantages and disadvantages of each so as i mentioned temporal difference learning can learn already before knowing the final outcome it can in fact learn online after every step that it has seen whereas monte carlo learning must wait until the end of the episode before the return is known and before it could actually execute its update in addition sample difference learning can learn without the final outcome this is useful framework for when you for instance only have incomplete sequences it could be the case that you have a database of experience that you want to learn from somehow but the database is corrupted or is missing some data and then thing for different temporal difference learning could still be used on the individual transitions that you do have monte carlo cannot do that and it really needs the full return in order to be able to do its update in addition the ability to be able to learn without knowing the final outcome means that temporal difference learning can also work in continuing environments in which there are no episode terminations whereas of course monte carlo needs full episodes in order to do its updates finally well not fine there's one more after this temporal difference learning is independent of the temporal span of the prediction so what do i mean with that that's a whole mouthful i mean with this that the computation of temporal difference learning is constant on every time step how many steps you want to do in an episode does not matter in terms of the computational complexity on each time step for temporal difference learning so why is that not true for monte carlo well monte carlo needs to store everything so to td can hear from single transitions but monte carlo must store all the predictions you've seen in an episode in order to be able to update them at the end of the episode so that means that the memory requirements for monster carlo actually grow when the episode becomes longer and longer this is a pure computational property it has nothing to do with the statistical properties of these algorithms but on the flip side temporal difference learning needs reasonable value estimates if these value estimates are very bad then obviously if you're going to construct targets using these then maybe your updates won't be very good and that means that actually there's a little bit of a bias variance trade-off going on here the monte carlo return is actually an unbiased estimate of the true value this is in fact how the true value is defined is the expectation of these returns but the temporal difference target is a biased estimate of course unless you already have accurate predictions but that's an edge case because we don't assume that we have them in general now the temporal difference target does have lower variance because the return might depend on many random actions transitions and rewards whereas the temporal difference target only depends on one random action transition and reward but in some cases temporal difference learning can have irreducible lies for instance the world might be partially observable and the states that we're plugging into these value estimates might not tell us everything that's already a problem for monte carlo learning because it means that the states that we're updating don't have all the information maybe to give you a complete accurate accurate description and therefore your value estimates will be a little bit off but you could imagine that this can get worse and indeed you can show this theoretically as well that this can get worse if you're additionally using these value estimates which are a little bit off because your state doesn't tell you enough in constructing the target for your update implicitly monte carlo learning a different way to think about this would account for all of the latent variables happening along the way so even though you can't observe exactly where you are the return itself would just take that into account because the return itself does depend on all of the environment variables that you can't maybe observe similarly but a little bit different the function used to approximate the values might fit poorly and this might also be true in the limits it might be that your function class let's say a linear function can't actually hold accurate predictions for all states if that is the case then temporal difference learning has irreducible bias in its target and therefore also in the values it eventually learns in the tabular case however both monte carlo and temporal difference earning will converge to the true value estimates we will talk more about these properties and especially about the function approximation part in later lectures so now to build even more intuition let's go into another example called a random walk so how does that look we have five states it's a small example it's meant to be an intuitive example and we have in each of these states two actions so we start in the middle state denoted c here and we either go left or right with equal probability the initial value estimates are a half for every state and above these transitions you see the reward depicted which is zero in almost all transitions except if you take the right action from state e then the episode terminates with a reward of one additionally if you take the left action from state a the episode also terminates with a reward of zero all of the other transitions are worth zero the true values happen to be 1 6 for state a 2 6 for state b and so on and so on until 5 6 for state e it might be an interesting exercise for you to go through this and to actually prove that this is the case using dynamic programming you could write down the probability of each transition and you could write down the reward function and you could do dynamic programming to find these value estimates so we put that market transition process here on the top and then we're going to talk about updating the values and first we're going to show you what td does so there's a couple of lines in this plot and these lines correspond to our value estimates for each of these states after a certain number of episodes so the line demarked with zero is fully horizontal and that's because we've initialized all the state values at one half then there's a line marked one here and it's actually completely the same as the line marked zero on most states except for state a the reason for that being that apparently this first episode terminated by stepping left from state a with a reward of zero all the other transitions this is an undiscounted problem so on all the other transitions basically when we for instance maybe steps from state b to state c we would have seen a reward of zero along the way and our temporal difference error would be reward plus next state value in this case undiscounted there's no discount factor or equivalently the discount factor is one and that next state value at c would be one half because that's where how they were initialized so our target would be one-half but our current estimate at state b would also be one-half so the total temperature difference error would be zero so we wouldn't update the value of state b on this intermediate step but eventually we reach state a and we take this step into the terminal state the terminal state by definition always has value zero so the temporal difference error for this last transition was actually minus a half because it's zero reward plus a zero next state value minus our current estimate for state a which was a half and then we see that we've updated the value of state a in a tabular way slightly down and in fact roughly 10 of the way down or maybe even exactly 10 for the way down it started at 0.5 and now it seems to be roughly around 0.45 so from this we can infer that the step size parameter the learning rate was 0.1 for this td algorithm then we can see that after say 10 episodes all of the values have by now been updated a little bit and after 100 episodes we're very close to these true values depicted here as a diagonal so this is just stepping through this problem step by step and it can be quite informative if you implement these algorithms for the first time to actually really take it take it easy and go step by step and look at every update to also make sure of course that there's no errors in the implementation but also to better understand these algorithms and then we could run these these different algorithms monte carlo learning and temporal difference learning on this problem and look at the root mean square error so this is the prediction error which we've now condensed over all states so we're looking at the total root mean squared error across all states or the average error we can see here that the on the x-axis we have the number of episodes we've seen so far so learning proceeds when we go from left to right and on the y-axis we see the error of the state predictions state value predictions so first i want to draw your attention to the darkest line in both these plots the black line this corresponds to the lowest step size that we've tried which is 0.01 we see that both for monte carlo learning and for temporal difference learning we see the smooth progression where the error goes down as the episode number goes up but it's fairly slow right so the error at um after 100 episodes is not that low yet and we can see it can clearly go down farther if we can leave it running longer but if we wanted to learn faster so if we wanted to only have maybe only had 100 episodes and we had to stop here it maybe makes sense to pick a higher learning rate or higher step size and we can see in that the brown curve which corresponds to a three times larger step size 0.03 indeed has a lower error not just at the end but actually at every episode along the way however then we start seeing a trade-off so i'm going to draw your attention again to the monte carlo plot and let's now consider the brightest line this corresponds to a fairly high step size of 0.3 and we can see that learning indeed is very fast at the beginning but then almost immediately stabilizes it doesn't go much below 0.45 roughly and we could see that the variance is also quite high now why is this the case this is the case because the monte carlo update itself has is high variance and we cannot reduce the variance further because we don't have a decaying step size here which means that the total error of this algorithm after a couple of episodes will be just the variants in the updates similarly if we reduce the step size slightly to 0.1 we see that the learning is slightly slower at the beginning but then goes lower the arrow does go lower but then also stabilizes around in this case slightly above 0.3 we can see something similar happening for td where there's this trade-off you can learn very quickly at the beginning but maybe the arrow stabilizes at a higher point but if we compare a temporal difference learning to monte carlo learning we can see that temporal difference learning allows you to set the step size higher and also has a different trade-off and indeed a lot of these errors are smaller than for monte carlo so if we look at for instance at the midway point at 50 episodes we can see that temporal difference learning then prefers a step size of 0.1 if you had to pick one constant step size whereas monte carlo learning prefers a lower step size of 0.03 because it has higher variance and also the error for monte carlo will be higher even if we tune the step size among these four four options now obviously you could also extend this and consider step size schedules which start high and maybe go lower when you learn more that's not the point here necessarily i just want to show you these properties of these algorithms where you can kind of clearly see from the plots that monte carlo simply has a higher variance than temporal difference earning and in this case that leads to higher errors for any constant step size essentially if you want to tune over on your constant step sizes okay now we're going to look even more in depth at these properties of these algorithms by considering batch updating so the we know that tabular monte carlo and temporal difference learning do converge to the true value estimates so if we store all of the values separately in the table we can update all of them independently to eventually become accurate under the conditions that your experience goes to infinity and your step size decays slowly to towards zero but what about finite experience of course in practice we won't actually have infinite experience we're going to learn for some time it might be a long time but it won't be infinitely long and we won't actually decay our step size all the way to zero perhaps so what we're going to do now is we're going to consider a fixed batch of experience and specifically we're going to consider having collected k different episodes and each of these episodes will consist of a number of steps so these number of steps per episode can differ but they might all have a large or small number of steps and then we're going to repeat each sample from these episodes and apply either monte carlo learning or temporal difference learning it says here td0 reasons for calling this algorithm td0 will become clear later but this is just the standard temporal difference learning algorithm that we discussed so far you can view this as basically similar to sampling from an empirical model where we talked about in the dynamic programming lecture having a model and then being able to use that to learn in this case you could also consider having the data set which in some sense defines frequencies of observed frequencies of transitions which is similar to having a model but it's an empirical model and it won't exactly fit with the real underlying model because you only have a finite amount of experience and now we're going to apply this idea of batch learning to a specific small example in which we only have two states where the purpose is for us to be able to reason through this all all the way so the two states let's just call them a and b there's not going to be any discounting and let's say we have eight episodes of experience here each line below denotes exactly one episode so one of these episodes starts in a state a and then gets a reward of zero then proceeds to a state b and then another reward of zero and then the episode terminates the next episode on the next line we started in state b instead of a and then we got a reward of one and terminated that happens more often so six out of these eight episodes are of that form where we start in b and then terminate with a reward of one and then we also have one episode that did start in b as well but it terminated with a reward of zero now i want you to think about maybe even without thinking about these algorithms at all i want you to think about what are the values of states a and b what do you think these values should be if this is all the information that you have you have absolutely no information apart from this data set so i encourage you to pause the video perhaps and think about this for a little bit and then i'm just going to proceed and going to tell you what i think are maybe plausible values so it could be that you came up with maybe even more than one answer or it could be that some of you came up with one answer and some came up with a different answer so let me motivate two different answers here first let me start with state b maybe that one's somewhat obvious so from state b we've seen eight episodes that were ever in state b and six out of eight times we saw a plus one and two out of eight times we saw zero so maybe the appropriate value for state b is 0.75 we've also seen one episode that was in state a and in that episode we got a total return of zero so one could reasonably say maybe the value of state a is always zero or at least that maybe is our best guess that we could do now when i say that some of you might actually object and say no no no that's not the right way to think about this the right way to think about this is that whenever you were in state a sure it only happened once but like whenever you were in state a you transitioned to state b and there was a reward of zero along the way this implies that state a and b must have exactly the same value that is also a reasonable argument but that means that the value of state a would be 0.75 which is quite different from zero if you were using that second line of reasoning effectively what you're arguing for is that the underlying model maybe looks like this that maybe this is the suitable way to think about this a goes to b a hundred percent of the time as far as we know and then b gets plus 175 percent of the time and zero 25 at the time so what is the right answer here well that's actually a little bit unclear and i'm going to explain why both these answers are actually in some way reasonable so monte carlo learning converges to the best mean square fit for the observed returns so you can write this as follows where we sum over the episodes k from one to big k and then within each episode we look at all the time steps from one to t k big t k for that episode and then we could look at all of the returns that you've seen and compare that to our current value estimates and then we could say we want to minimize the squared error between these returns and that we've observed and these value estimates and that indeed sounds like a very reasonable approach right we're just minimizing the difference between the returns we have seen and the value estimates that we have in the example that we've just seen this would imply that the value of state a is zero because the only return we've ever seen in state a was zero instead temporal difference learning converts to the solution of the max likelihood markov model given the data that's what we saw on the previous slide this is the most likely model given the data that we've seen so far and then temporal difference learning turns out finds that solution that corresponds to this model so if you agree with that approach if you say well that's what you should be estimating and then you should be solving that that's what temporal difference learning does so this would be the solution of the empirical markov decision process assuming that the empirical data you've seen is actually the true data so in the example this is what gives you the same estimate for both states a and b now you might find one better than the other but why would you take one or the other this actually is a little bit of a subtle argument and turns out you can think about it as follows where temporal difference learning exploits the mark of property and this means this can help learning in fully observable environments what i mean with that is that the assumption essentially when we built that model that empirical model in the previous example the assumption there is that if you're in state b then it doesn't matter that you were in state a before we can just estimate the value of state b separately and this will tell us all that will happen from state b onwards that's basically making an assumption that state b is markovian instead you could also say and this is what monte carlo exploits well what if we don't assume that then we could say well whenever we were in state a it turns out that our second reward was zero and this could be related it could be that whenever we are in state a that we already know that all of the rewards in the future are going to be zero if we then reach state b it might be that this is still true but we just can't see it it's a latent variable it's a hidden variable this means that the world is not fully observable if that were true for the problem that we were in that would be the better estimate perhaps and indeed learning with monte carlo can help in partially observable environments because it makes less of an assumption that the states are useful to construct your value estimates so we mentioned this before so in some sense you can view this example with this two-state batch learning as an example of this difference in terms of how they deal with fully observed versus sparsely observed environments important to note is that with finite data and also with function approximation the solutions even in the limits might differ between temporal difference learning and monte carlo these are two different statements we saw just we've seen it just now for finite data it's also true for function approximation i haven't actually shown you that but we'll get back to that in later lectures so now of course maybe a natural question is can we maybe get the best of both first i'm just going to show you a unified view where we put dynamic programming in the top left and the reason why it's there is that left corresponds here to shadow updates where we just look one step into the future the top versus bottom in the top we we look at mechanisms that look full breath of course in order to do that you do need to have access to the model so that means that for instance if we look both full breadth and full depth this would give you exhaustive search i'm just showing that to complement the other approaches i'm not saying that's an algorithm that you should be using it's computationally quite expensive and of course you also need a model but it clearly fits in the figure in terms of like comparing breadth and depth of these algorithms then if we go down we go to these algorithms that only look at one trajectory and these can be used when we only can sample when we only deal with interaction so we see temporal difference learning in the top sorry the bottom left and monte carlo in the bottom right where we can think now about temporal difference learning as having a breadth of one but also only a depth of one so we can just take one step in the world and we can use that to update a valley estimate whereas monte carlo makes a very deep update it rolls forward all the way until the end of the episode and uses that full trajectory to then update its value estimates now as discussed temple difference learning uses value estimates which might be inaccurate and in addition to that which we haven't quite uh talked about so much yet the information can propagate quite slowly i'll show you an example of this in a moment this means that if we see a reward that is quite useful like it's a surprising reward tempo difference learning will by its nature only update the state value immediately immediately in front of it if in that episode you never reach that state again that means that all of the other state values don't learn about this reward whereas monte carlo would update all of the previous states that you visited in that episode eventually to learn about this new reward so that means that temporal difference learning has a problem in the sense that information can propagate quite slowly backwards and therefore the credit assignment could be quite slow now monte carlo learning does propagate information faster as i just said if you do see your surprising reward monte carlo learning will at the end of the episode tell all the previous states about that but of course the updates are noisier and it has all the other properties that we talked about before now we can actually go in between these and one way to do that is as follows instead of looking exactly one step ahead as temporal difference learning does we could consider looking two steps ahead or three steps or generically end steps ahead then we could consider monte carlo learning to be at the other end of other extreme essentially where monte carlo looks infinitely far into the future up until the end of the episode written in equations you could write that as follows where maybe we can introduce new notation where we use superscript with brackets so superscript since g with one between brackets is now a one step return which takes exactly one step in the world rt plus one and then bootstraps at our current value estimates of sd one so one step a one step return corresponds exactly to temporal difference learning as we defined it before an infinite step return shown in the bottom here would correspond to monte carlo learning because if you take infinite steps you always will reach the end of the episode before you choose to bootstrap but then in between we could consider for instance a two-step approach which takes not just one reward but takes two rewards into account and then bootstraps on the value of the state at time step t plus 2. in general we can say the n step return so oftentimes i say multi-step returns here in the tight title of the slide but colloquially people often refer to these mechanisms as air using n-step returns and it's defined by simply taking n rewards into account and then appropriately discounting them please note that the last reward reward at t plus n is discounted only n minus one times which is consistent with the one step approach but the value estimate is then discounted n times and then we can just use this in our updates so does that mean that we have to wait a little bit of time before we can actually execute this update and we do have to store some estimates or states along the way but only as many as we want so if we have a 10 step return we have to wait 10 steps every time before we can update a state value and we have to store the ten states along the way every time in some something of a buffer so it has these intermediate properties that are both statistically and computationally a little bit between a temporal difference ordering on one extreme and monte carlo learning on the other extreme so now i'm going to show you some examples to make that a bit more concrete first we're going to talk about an example to illustrate essentially this property that td doesn't propagate information for backwards and for that we're going to use sarsa i'm going to remind you that sarsa is simply temporal difference learning for state action values so if we look on the left we see a specific path that was taken we started apparently over here then we went right right up right up and so on in the end we reached the goal if we would then do one step td in this case one step sarsa we would only update the state action value for the action that led immediately to the goal this is of course assuming that all of the other intermediate steps have no information maybe the rewards are zero maybe your current state estimates are also zero and maybe you only get a plus one when you reach the goal if we could then instead consider a 10-step update it would actually update all of these state values appropriately discounted along the way so it would propagate the information much further back and then if we consider the next episode this could be beneficial because in the next episode it could be that you start maybe here again and then if you only did one step td or sarsa in this case it could be that you just meander around without learning anything new for a long time whereas if you would have done a 10 step return at least you're more likely to more quickly bump into one of these updated state values and then information could start propagating backwards to the beginning where you start so we can apply this to a random walk just to see and get a bit of more of intuition so we see the same random walk that we talked about before here at the top but now let's consider giving it 19 states rather than five but otherwise it's the same there's a starting state in the middle there's a plus one reward on one end and there's a zero on the other end and then we can apply these n-step-r algorithms and see how they fare so what you see here on the slide is something called a parameter plot because on the x-axis we have a parameter in this case the step size parameter and on the y-axis we see the root mean squared error over the first 10 episodes so we're not looking at infinite experience here we're just looking at a finite amount of experience and then we see how do these algorithms fare if you then look at all of these different step sizes so for instance let's look at n is one n is one i remind you corresponds exactly to the normal temporal difference learning algorithm that we discussed before we see that the best performance or the lowest error if we only care about the first 10 episodes and we have to pick a constant step size is for a step size around maybe 0.8 if you set it higher or lower the error is a little bit worse this has been averaged over multiple iterations so this is why these curves look so smooth this is of course a fairly small problem so you could run this quite quite often to get very good insights on what's happening very precise insights i should maybe say and what we notice here is if n is two so we consider a two-step approach we see that if the step size is really high if it's one that it actually does a little bit worse and this is because the variance is in some sense higher but you can actually tune your step size around in this case 0.6 to get lower arrows than where possible with the one step approach over the first 10 episodes then taking n is 4 is maybe even a little bit better but notice again that the preferred action sorry the preferred step size parameter is again a little bit lower this is because if we increase n as we do here the variance goes up more and more and more and that implies that we need a lower and lower step size to make sure that we don't have updates that have too high variance which would make our error higher then if we go all the way up here we see a line that is marked with the number 512 that is a 512 step temporal difference learning algorithm which in this case is essentially monte carlo because there are very the probability of having an episode that is more than 512 steps long is very small and we can also see that by the fact that the 256 step tempo difference earning algorithm is actually quite similar because both of those are already quite similar to the full monte carlo algorithm for these algorithms we see two things first they prefer very small step sizes of much much smaller than temporal difference learning the one september difference learning and in addition even if you tune the constant step size very well the error will still be quite small quite high sorry and if you set the step size too high the arrow will be much much higher they go off out of the plot here so we clearly see a trade-off a bias variance trade-off here essentially where an intermediate value of n helps us learn faster and helps us get lower error in the first over the first 10 episodes for a well-tuned step size parameter and the best values are not the extremes so it's not n this is one and it's also not n is infinity so it's not td it's not monte carlo but it's some intermediate n-step td algorithm okay so make sure you understood that last bit because now we're going to go on to a next very related bit though which is on multi mixed multi-step returns so we've just talked about n-step returns so as i said make sure you understood that part before continuing here and these n-step returns they bootstrap after n steps on a state value st plus n one way to write down these returns is almost recursively they don't quite recurse but you could look at them these returns as doing the following where if you have an n step return you take one step rt plus one and then you add an n minus one step return right so every step you lose one step until you only have one step left and the one step return will then bootstrap on your state value so why am i writing it like this because you can then look at these different cases and we could basically say well on some of these steps we fully continue with the random sample with the trajectory whereas on other steps we fully stop we just say okay that's enough bootstrap here and now i'm going to argue well you could do that but you don't have to do that instead there's a different thing which you could do which is to say you could bootstrap a little bit for instance you could have a parameter which we're going to call labda so we're going to take one step as always and then we're going to take the discount factor but then instead of either continuing or bootstrapping fully we say let's bootstrap a little bit so this is a linear interpolation between our estimated matrix value and the rest of that labda return as we call it this is defined recursively so this lab of return is at the next time step for the same lab so this thing will also take one step and then again bootstrap a little bit before continuing and then again it would take one step and then again boots up a little bit before continuing even farther turns out if you do the math this is exactly equivalent to a weighted sum of n step returns in fact of all of the n step returns from one to infinity these weights do sum to one so it's a proper weighted average of these and we can note by just stepping through a few examples if n is one then this quantity here goes away this is just one so the the one step return will be weighted exactly with a weight of one minus lambda then the two-step return will be weighted but by a weight of lab that time for a minus laplace so slightly less typically labda is a it's a number between zero and one and typically it's a number that is maybe closer to one than to zero but if we set loudness to say a half for simplicity that would mean that our lab of return would basically take a one-step return with weights a half a two-step return with weight a quarter a three-step return with weight one-eighth and so on and so on then we can consider these special cases where if labda is zero we can say that see that this term would completely disappear and this term would just get weight one that means that la da zero corresponds exactly to the standard td algorithm whereas if we consider the case where lab days one then the recursion is full we just have one reward plus discounted next lab the return but if lab day is one this would also just take the full next reward and ends one so that would be exactly the same as monte carlo so we can see we've have the same extremes as before for the n-step returns where lab is zero now corresponds to temporal difference learning one-step temporal difference learning and lab as one corresponds to full monte carlo this is why on a previous slide you saw td0 pop up that was actually referring exactly to this algorithm where we say there's a more generic algorithm called td labda but you can set lablet to zero and then you have your one step td algorithm we can compare these to these n step approaches and we here we plot them side by side for that same multi-step random walk and we see some commonalities first let me draw your attention to lab that is zero which is exactly the same curve as for n is one this is true by definition because this is these are both the one set td algorithm and then similarly we can see for lab is one i promise you that was exactly monte carlo and we can indeed see that that's very similar to the curve here for the 512 and also for the 256 step td algorithms intermediately the curves look slightly different you can see that these this curve for instance branches are slightly differently than that one and they also don't exactly correspond to each other because these n-step approaches always choose exactly one time step in which the bootstrap whereas the labda approaches they bootstrap a little bit on multiple timestamps but you can see the curves actually look somewhat similar and there's a rule of thumb that you can use to think about how these relate to each other maybe you think all these n-step approaches i find a little bit more intuitive because i can reason about two steps but i find it harder to reason about this lab that well one way to think about this is that there's a rough correspondence where we can think of one divided by one minus lambda as roughly being the horizon of the lablet return so for instance if we set labla to be 0.9 then this would be this one minus lab that would be 0.1 and then 1 divided by 0.1 would simply be 10. so we could say that louder 0.9 roughly corresponds to an n-step of 10 and we can see that corresponds here in the plot where we can see that lab.0.8 which would correspond to a 5-step method is indeed quite similar to this 4-step method over here and 0.9 which corresponds to roughly 10 steps is indeed quite similar to this eight-step temporal difference learning method here and they're even colored the same way that maybe makes it a little bit clearer this corresponds as well they're slightly different you could look for instance at the lab.4 which is presumably quite similar to lab.5 which would correspond to this two-step approach and especially for higher learning rates they do look a little bit different but there is a rough correspondence and this is one way to think about these other algorithms and now we're going to talk about some benefits so we've already alluded to this so multi-step returns have benefits from both temporal difference learning and monte carlo and the reason to consider them is that bootstrapping can have issues with bias so one step td is not always great and monte carlo can have issues with variance and typically we can think about intermediate values as being somewhat good because they trade off bias and variants in an appropriate way in addition i talked about this information propagation which has a similar flavor to it so intermediate values what do i mean with that well typically we see things like n is 5 or n is 10 or something as a form or a lab.9 is somehow always magically a good value so these are intermediate values that tend to work quite well in practice okay now make sure that you understood all of the things i was talking about before before continuing to the next section in addition i want to basically attend you to the fact that the next part is uh more advanced and you don't need it to continue with the rest of the lectures so you could also pause here and maybe return to this later if you wish um or you could just continue now of course because it's quite related to the things we were talking about before but i just wanted to draw your attention to the fact that this is going to be in some sense a little bit orthogonal to some of the things that we'll be discussing later okay so having said that let's continue so we talked a little bit before about this dependence on the temporal span of the predictions and maybe you've already realized this these multi-step approaches these mixed multi-step approaches especially that we talked about are actually not independent of spam which means that similar to monte carlo you actually have to wait all the way until the end of the episode before you know your labor return right the lab that trades off statistical properties of the updates but it still has the same computational update that you actually can only construct your full ladder return when you've reached the end of the episode now that doesn't seem very desirable and indeed you might also sometimes want to do monte carlo learning and then that might also not be desirable and conversely temporal difference earning can update immediately and is independent of the span of the predictions so before maybe you took this to be an argument that we should be using temporal difference learning but here i'm going to make a different argument and we're going to ask can we get the best of both worlds can we use these mixed multi-step returns or like other flavors of that including maybe even monte carlo but with an algorithm that doesn't have computational requirements that grow indefinitely during the episode and turns out of course i wouldn't be asking this question if it wasn't the case so i'm going to explain to you how that works for concreteness let's recall linear function approximation where we have a value function that is a linear function so we have a value function defined as an inner product of a weight vector w with some features x and then monte carlo and temporal difference learning the update to the weights can be written as follows for monte carlo learning it's a step size times your return minus current estimates times features and for tempo difference learning is your step size times your tempo difference error thanks for features and then for monte carlo learning we're going to talk about monte carlo first for a bit we can update all states in an episode at once so this is typical because you have to wait for the end of the episode for all states anyway to know the return so it's quite typical for monte carlo loading to then just update all of them at once in one big batch here we're going to use t from 0 to big t or in this case big t minus 1 to enumerate all of the time steps in this specific episode so we're restarting the time to count from zero in every episode essentially and now oh yeah first i just want to record a tabular just a special case where the x vector is just a one-half vector and now i'm going to argue that we can look at these individual updates here so here we have an update on every time step and we just summon over time steps i'm going to argue that you can define and update as follows i'm going to prove this in a moment but first i'm just going to give you the update to build some intuition this update is a very simple update that takes an alpha step size parameter times your one step td error this is not your monte carlo return this is the one step td error times some vector e this e is called an eligibility trace and it's defined recursively by taking the previous trace decaying it according to gamma and ladder and then adding our current feature vector or more generally your gradient x t note special case if lab is zero we do indeed get one step tbtd back so there's a choice here whether you want to do online td which means you update during the episode you could also consider for conceptual simplicity you consider offline td where you still just accumulate these weight updates until the end of the episode but then this algorithm would just be td for label a0 the intuition here is we're basically storing a vector that calls all of the eligibilities of past states for the current temporal difference error and then we're going to add add that to our update to the weights so when ladder is not zero we have this kind of trace that goes into the past it's similar to momentum but it serves a slightly different purpose here which will hold recursively this e t minus one will therefore hold the x t minus one the features from the previous state and but property decayed and also x t minus two and so on and so on each decade even more so this is kind of magical i haven't yet shown you why it works or how it works but this means that if this works we can update all of the past states to account for the new temporal difference error with a single update we don't need to re-recompute their values we don't need to store them so this would be an algorithm if it works that will be independent of the temporal span of the predictions we don't need to store all of the individual features x t x t minus 1 x t minus 2 and so on and so on instead we just merge them together by adding them into this vector and we only have to store that vector this idea does extend to function approximation i mean this is already the linear function approximation case but the so in general xt does not have to be one hot but it can actually also be a gradient of your value function for the nonlinear case how does that look well here we're just going to show that kind of intuitively again so there's a markov decision process here in which we only ever step down or to the right in the end we reach some goal states from which we get a reward of plus one and this then we transition all the way back to the top so the trajectory here will be somewhat random the true values are there's discounting so the true values are larger the closer we are to that goal if we then imagine doing only one episode td0 would only update the last state so the state in which we saw that one reward td labda would instead update all of the state values along the way less and less so with labda this is true if you do the version that we had before with the louder return but turns out the exact same thing is also true if you use these editability traces and the way to think about that is that at some point we see this temporal difference error where we finally see this reward of plus one that we've never seen before and then we multiply that in with all of these taking into account all of these previous features that we've seen in this case it's tabular so all of the features are one hall but so basically we're storing we're storing all of the states in this vector and we're updating it in one go now i'm going to show you how that works and why it works so we're going to consider monte carlo first and later i'll put the lab.parameter back in so we're just going to take the monte carlo error and we're first going to rewrite this as a sum of temporal difference error so to start let's first write it out one step so we get our first step our reward t plus one and then our discounted next return next step return and we're just taking along the minus v parts from uh from the left hand side now this already looks similar to temporal difference error but it's not so for a temporal difference error we should bootstrap so what we can do is we can just add that in so we basically add gamma times v of the next state in and then we also have to subtract it so we're basically adding zero but this allows us to write this as a temporal difference error plus a different term and this term notice that this is exactly the same term as before but just at the next time step so now there's a t plus one rather than a t so then we can write this as an immediate simple difference error plus a discounted monte carlo error at the next time step of course we can repeat this so the monte carlo error at the next time set will just be the td error at the next time step plus the discount the monte carlo error at the time step after that so we repeat this over and over again and we notice that we can write excuse me we can write the total monte carlo error as a sum of temporal difference error of one step temporal different error and i'm just going to put this basically aside for now and i'm going to use this on the next slide so now we're going to go back to this total update so at the end of the episode we're going to update all of our weights according to all of these monte carlo returns that we've seen constructed so far which we've only were able to construct at the end of the episode now i'm going to plug in this thing that we derived in the previous time step sorry the previous slide where we're replacing this monte carlo error with this summation over time of temporal difference errors note that the summation only starts at time step t which is the time step from which the monte carlo return also starts and now we're going to use a generic property of double sums we have an outer sum that sums from t 0 to t minus 1 to big t minus one and an inner sum that starts at this variable t and then continues to the same endpoint turns out if you have a double summation like this where the inner index depends or the starting point of the inner index depends on the outer index you can turn them around and you could first sum over all of the outer indexes instead and then sum over the what used to be the outer index sorry so now we're summing over all of the what used to be the inner index k and then we we sum off what used to be the outer in x t but only up to k so instead of starting at t starting k at t and then going up we do all of the k but then we only do the t up to k this turns out to be exactly equivalent it's a generic property of these double sums and i'll show you this intuitively in a moment as well and then we notice that the temporal difference error delta k does not depend on this variable of the inner sum it doesn't depend on t so we can pull that out and then we just say oh this rest thing that we have here this summation from t is zero to to k it's a discounted sum of feature vectors and we give that an a we just call that e where we notice that e actually only depends on time steps on time step k and before so this means we can write this down as follows where we now swap k for t because we only have one variable that we're summing over now so we can just rename k to t and that means that the original summation can be rewritten in this form so originally we had for every state we had this update of a return looking into the future right this is why we had to wait all the way until the end of the episode because we have to wait for gt to be completely defined and we swap this around to a summation where we have our same step size and our temporal difference error delta t and then this weight vector which kind of looks into the past it stores all of the feature vectors that we've seen in previous states and we just kind of went through this mechanically right we haven't actually taken any magical steps you can just do this step by step of course coming up with this derivation is it may be a different matter but you can now follow this step by step and see that this is indeed true so that means that our total update can be written as this where this eligibility trace vector was defined as the summation of the time step t of our discounted feature vectors this we can also inspect a little bit more and we can for instance pull off the last time step which is just xt note that when j is equal to t then this discounting goes away because it would just be your discount to the power 0 which is 1. so we get this term that we pull off and then the summations only to t minus 1 rather than enter t now we can also pull off one discount factor exactly one so there will be now a minus one here and then we can notice that here e t was defined of a summation of something that goes to t and inside there's a t and now we have something that only sums up the t minus one and inside as a t minus one but otherwise it's exactly the same so this thing must be by definition e t minus one which means we can write this recursively this summation as a discounted previous e t minus one plus the current xt and then et we're going to call that an eligibility trace so the intuition here is that every step it decays according to the discount and then the current feature is added because we're doing full monte carlo so discount here is appropriate for propagating information backwards because you shouldn't propagate the information more than it is used in these monte carlo returns which came in earlier states and that should take into account the discounting so then summarizing we have this immediate kind of like update for every time step which is completely well defined on time step t we have all these quantities available for us at time step t then the monte carlo update would sum all of these deltas over all of the time steps in the episode and then apply them this is the offline algorithm so in this case even though we can compute these things along the way we can now compute this sum incrementally as we go so we don't have growing memory requirements but then in this specific algorithm we're still applying them at the end of course you might not already be thinking but that feels unnecessary can't we just apply them during the episode and yes indeed you can so this is indeed an interesting extension that you can now start doing something that would be equivalent to monte carlo if you would wait all the way until the end of the episode and only then apply these differences but you could already start learning journey during the episode so we haven't just reached an algorithm that has the property that is independent of span which is a computational property but we've actually also arrived at an algorithm which is now able to update its predictions during long episodes even though we started with the monte carlo algorithm the intuition of this update is that the same temporal difference error shows up in multiple monte carlo errors in fact in all of the monte carlo errors for states that happened before this time step and then what we do is basically we group all of these states together and apply this in one update you can't always do that but you can do that here now i'm going to look at that that double summation thing a little bit and show you why that works and for doing that we're going to concretely consider an episode with four steps so it says delta v here it should have written its delta w there so we've noted before that delta w for an episode is just the summation of all the monte carlo errors with their feature vectors and we've also noted that we can write each of these as a summation of appropriately discounted temporal difference errors multiplied with that same feature because we're just pulling off out this part the monte carlo error and then essentially all that we're doing is instead of summing across these rows we're going to sum across the columns which will look like this where basically we notice that across the columns the the temporary difference error is always the same so instead we're going to merge all of these appropriately discounted state feature vectors and we're going to merge it into a vector called the editability trace feel free to step through that of course much more slowly than i'm doing right now and then we can do mixed multi-step returns as well by putting the lab the parameter back in how that works is we had our mixed multi-step return g labda and it turns out if you write this as a sum of temporal difference error you can go through that yourself in a similar way that we did the monte carlo returns it turns out you get this summation which no longer now no longer just says your discount factor but it has a quantity ladder times gamma otherwise it's exactly the same thing so it turns out we can write these labdad returns as also a summation of one step temporal difference error but with gamma times lambda where we had gamma before and that means if we go through all of those steps again that we did before for the monte carlo case we get exactly the same algorithm except that the temp that the eligibility trace now has a trace decay gamma times lambda rather than just gamma so you feel free to go through all of that yourself to convince yourself that this is actually the case and that means that we can then implement this algorithm and actually also maybe apply these updates online as we go this is called an accumulating trace because every time we add this x there it turns out in the literature there's a couple of other traces as well which i won't cover now but they are mentioned in french and susan lombardo in the book the equivalences here between the monte carlo updates and then doing this with these traces is only completely exact if we update offline that means we wait we store all of the weight updates we can add them together already but we don't actually change our weight yet but there are also trace algorithms that work and are exactly equivalent in some sense for online updating so these traces look slightly differently as i mentioned i won't go into that but i just wanted to make you aware that they exist okay that brings us to the end of this lecture in the next lecture we will be talking about applying these ideas to control to optimize our policy thank you for paying attention\n"