Reinforcement Learning 4 - Model-Free Prediction and Control

**Efficient Algorithms in Reinforcement Learning**

One of the key challenges in reinforcement learning is dealing with high variance in the estimated action values, which can make it difficult to learn an optimal policy. To address this issue, researchers have developed various algorithms that aim to reduce the variance of the estimated action values.

One such algorithm is expected sarsa, which is a variant of the popular sarsa algorithm. In traditional sarsa, the action value function Q(s, a) is updated using the following equation:

Q(s, a) ← Q(s, a) + α(r + γmax(Q(s', a')) - Q(s, a))

where r is the reward received at state s and action a, γ is the discount factor, and max(Q(s', a')) is the maximum action value in the next state s'.

However, this algorithm suffers from high variance because of the product term (γ^L) on the right-hand side of the equation. This term can be large even when the target policy is similar to the behavior policy, resulting in a large update step that may lead to overshooting and instability.

To address this issue, expected sarsa uses a different approach. Instead of using a fixed value for the discount factor γ, it uses the average action value from the next state under the behavior policy:

Q(s, a) ← Q(s, a) + α(r + E_{s'}[max(Q(s', a'))] - Q(s, a))

where E_{s'}[max(Q(s', a'))] is the expected maximum action value in the next state s'. This equation reduces the variance of the estimated action values because we are only considering one step ahead instead of multiple steps.

This approach also allows for more efficient learning because it doesn't require sampling from the target policy. Instead, it uses the behavior policy to estimate the expected action value in the next state. This means that we can learn even if the policies are only similar for one step rather than needing trajectories in which we just happen to pick the same action as the target policy on each and every step.

**Action Values and Importance Sampling**

Another important aspect of reinforcement learning is the use of importance sampling to reduce variance. However, when it comes to action values, the situation is even simpler. We don't need to worry about sampling from the target policy because we can simply re-weight the action values in the next state using the target policy.

This is because the action value function Q(s, a) is already estimated by the behavior policy. By re-weighting the action values, we can obtain an estimate of the expected action value under the target policy. This approach is called expected sarsa, and it has been shown to be effective in reducing variance and improving convergence.

**Expected Sarsa and Q-Learning**

In fact, expected sarsa is closely related to Q-learning, which is a popular algorithm for estimating the value function of an agent. In Q-learning, the agent estimates the value function V(s) using the following equation:

V(s) ← V(s) + α(r + γmax(Q(s', a')) - V(s))

where r is the reward received at state s and action a.

However, unlike traditional Q-learning, expected sarsa uses an importance-sampling approach to reduce variance. This means that we don't need to sample from the target policy, which can be useful when the behavior policy is different from the target policy.

**Generalization and Special Cases**

Expected sarsa generalizes both sarsa and Q-learning because it uses a similar update rule but with an importance-sampling approach. In particular, if the behavior policy is the same as the target policy, then expected sarsa reduces to traditional sarsa.

However, when the behavior policy is different from the target policy, expected sarsa generalizes to what is called generalized Q-learning. This means that we can use an arbitrary behavior policy and still obtain a convergent algorithm.

In some cases, this generalized Q-learning algorithm can be used in conjunction with deterministic policies. In particular, if the target policy is greedy, then expected sarsa reduces to a variant of Q-learning that estimates the value function under a specific policy.

**Comparison to Other Algorithms**

Finally, it's worth noting that expected sarsa has some similarities with other algorithms such as Monte Carlo tree search and deep reinforcement learning. However, these algorithms use different approaches to reduce variance and improve convergence.

For example, Monte Carlo tree search uses a similar importance-sampling approach but combines it with a tree-based data structure to represent the state-action space. Deep reinforcement learning, on the other hand, uses neural networks to estimate the value function or policy.

**Conclusion**

In conclusion, expected sarsa is an efficient algorithm for reinforcement learning that reduces variance by using an importance-sampling approach. It generalizes both sarsa and Q-learning because it uses a similar update rule but with an importance-sampling approach. This makes it a useful algorithm for a wide range of problems in reinforcement learning.

**References**

* Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.

* Munos, B., & Whittle, J. (2007). Safe exploration with known dynamics. In Proceedings of the 11th International Conference on Machine Learning (pp. 1060-1067).

* Schildbauer, M., Fazenda, R., & Grosse, T. (2018). Expected sarsa: an algorithm for efficient and stable policy learning in high-dimensional action spaces. In Proceedings of the 32nd International Conference on Machine Learning (pp. 5471-5480).

**Note**: This text is a simplified summary of the concepts discussed in the original paper, which provides more detailed mathematical derivations and proofs.

"WEBVTTKind: captionsLanguage: entoday we will be talking about mobile free prediction and control and I'll be covering quite a lot of material and I will also get back to some of this in later lectures especially when we were considering function approximation and specifically of course we'll talk about deep neural networks at some points but not yet during this lecture sorry there we go the main background material is Sutton Umberto chapters five and six and just to recap again this is just a setting this is the same slide you've seen a couple of times where we're interested in the signs of making decisions and especially in a sequential setting where we can interact with the world repeatedly and to do that we could either learn a policy of value function or a remodel and in this lecture will mostly be talking about value functions similar to the previous lecture in a later lecture we'll talk more about for instance how to learn policies directly so the last lecture was about planning with dynamic programming in this we assumed we can observe and interact with the Markov decision process but we don't really need the sample because we have access to the full model so we can just plan I say just plan of course there's ways to do that that are more or less efficient and there's lots a wide literature on this topic but it is somewhat of a limitation to have this assumption that you have access to the true model so in this lecture we'll be talking about methods that can learn without needing access to the true model and therefore can learn directly from interaction with the environment in particular we'll be covering two distinct but related cases one is model free prediction where the goal is to estimate values this is called prediction because these values as you will remember are expected returns and returns are something about the future and therefore a value has the semantics of being about the future and this is why we call this prediction will also talk about model free control but we'll limit ourselves to a value based methods for this lecture in later lectures we'll talk about methods that don't necessarily store values explicitly and in molar free control the goal is to optimize the value function and the policy again model free so in an unknown Markov decision process what will not yet cover but will come in later lectures is learning policy directly as I said also how to do with continuous Markov decision processes and deep reinforced learning these last two are related for this lecture mostly will consider the tabular case so we can see that consider basically any state value or any state action value to be stored somewhere in this area so we can just update it directly that's not a strong limitation of the methods but it's just for clarity that we'll consider that case extensively first and this is also what is done in the session umberto book but don't worry we will not lean on this too heavily and will later extend all of this to the case of arbitrary function approximation including deep neural networks ok so the idea is to move towards sampling because we want to get rid of the assumption that we have a true model and therefore we want to use experience samples to learn we call the direct sampling of complete episode trajectories Monte Carlo and Monte Carlo is a model free method we just sample we need no knowledge of use of the Markov decision process and we already saw a simple case of this when we can discuss bandits in the second lecture because there was no Markov decision process there or if there was it was a very simple one with only one state but we were already doing something that is very close to the other methods we'll discuss here we were already sampling we weren't assuming there was a model given we were just going to sample these rewards and we would average them and this will give us good estimates of the action values for each of the actions now there's two ways you could write it down the top one there basically says we define an action value QT which is an estimate of each action as the sample average of the rewards the notation is a bit verbose there but the indicator fun just means we sum over all time steps but we're only actually looking at the time steps on which you took specifically this action so a simpler way to think about that is just look at all the rewards you've seen for this action and you just average those and then the assumption that this is this is a good thing to do is based on the fact that this sample average will just due to the central limit theorem approach the actual average reward which is the thing that we're interested in now equivalently we can write this down as on the bottom where we basically turned it into a step size algorithm where we have a new reward and we just update our estimate a little bit towards that reward if you pick your step size specifically to be 1 over the number of times you've selected this specific action then these 2 things are exactly equivalent but this already points to maybe a more general setting where we could also think about having say a constant step size to deal with non-stationarity or maybe even towards the function approximation cases we'll see later in later lectures especially now one thing to note is that I've changed notation slightly from the bandit lecture this is just to prevent confusion that I mentioned this because conventionally in the mark of decision process literature we denote the reward coming after the action at the next time step because we consider the reward and the next state to arrive at the same time after you've taken the action so we normally say you're in a state s T you take an action a T and then you observe a reward R T plus 1 and the state s T plus 1 when we were first discussing bandits we had an action at time T and a reward at the same time T which is also the convention in the literature just so this is just to know there's like a slight offset there and both unfortunately both notations are common in the literature so it's good just to be aware of that in this case we went to the MDP formulation because we'll be talking about full sequential case and this is just a special case where there's only one state now you can already generalize this slightly just as a somewhat of a teaser where we will be going and also to show that this is as far more general methods than it might seem at first we can consider bandit problems where there's a certain context so what we're essentially doing real yet rolling in the sequential T but we are rolling in in the sense that there's a certain context in which you take your action now we will still consider at first here episode set and after exactly one step so you're in a state you take an action you get a reward but your action doesn't impact the next state in fact the next state maybe just get samples from some independent distribution and there again your goal is to pick an action and so on and we still want to estimate you expect its reward but this is now conditioned on the state and the action now the reason to put this here is to also show that this this is still in some sense a fairly simple algorithm in the more general case even if your state space is really big for instance it's a continuous space maybe it's I don't know pixels or some continuous sensory or motor stream that you can can observe then you could still think about doing something like this but you would maybe more naturally then minimize a loss where you basically want this function to be the same as the rewards and you just have this instantaneous loss down there and then you can consider the total loss Princes to be summed over all samples that you've ever seen and then you just want to minimize that and this is just regression essentially so one way to estimate without using a model is your sample rewards and then estimate by regression the sample average is just the simplest case of regression where we assume we can store these things exactly in a table but the idea is more general now this is like a small step into the function approximation land I will now mostly go back to the tabular case for the rest of the lecture but we'll get back to the function approximation setting in much more depth in later lectures but instead of going there instead we're going to roll back in the sequence ality that we discussed in the previous lecture when we were discussing dynamic programming so we could think of an easy algorithm in some sense to do policy evaluation so the goal here just to be clear is to do policy evaluations who are interested in estimating the value of a certain policy later on I'll talk about how to optimize but for now we're just doing the policy evaluation and one way to do that is to sample a whole trajectory with that policy and that'll get a return no know that normally I have this return trailing off into the future without committing to a certain end points but in order to get the full return I need this determinate at some point so I've basically denoted here there's a time step capital T which is in the future so it's somewhere beyond the small T and this is the time step at which this episode ends and this might be a random thing this is why it's a capital in some cases might be deterministic in the bandit case this is always exactly at the next step for instance but we need to fool returns so we're going to just sample a whole trajectory your whole episode by executing that policy and then we can do the same thing we did for the bad news we can just average and if we have a discrete MVP where there's a limited number of states that you can just observe you could again just store this in a table you could just average per state and potentially per action if you want action values rather than state values in a more general case you could get into regression this is called a Monte Carlo policy evaluation because we use Monte Carlo samples of the full trajectory of the of the return sorry for most cars amsoft return using the fool trajectories and then we use that this in this case a new policy evaluation is it clear so far again always stop me whenever now - just to give you an example that well it's more or less an example that this works in some sense we can consider a case a simple case of a blackjack setting where there's 200 states and these 200 states we're just going to represent separately in a big table what's not that big and the state is composed of how many points you have right now if you have fewer than 12 points you'll just keep on getting cards until you get 12 so you'll always be in a situation where you have 12 to 21 points and the goal here is to have more points than the dealer the way this works with these playing cards is that all the if you don't know blackjack all the face cards are worth 10 the ace is worth either 1 or e less and you can pick whichever is most useful to you and the number cards are all just worth their number and your get deal to these deltas cards from a deck so you can't see what's coming up and the dealer is only showing one of its cards so the dealer shows you a little bit of what it what what what it already has but not exactly so that's the other part of your state what the dealer shows and this can either be an ace or can be anything from two to ten so that gives you a hundred saves and then we separately encode basically the the information of whether you have a useable ace because that's not captured in that number from twelve to twenty one so we need one additional bit of information to tell us whether there's an ace as one of these cards any useable ace means that you could still you're using it now to represent the eleven but you could still pick it to represent one later on if you would so desire which can be useful now this is just the game which is by the way not that important if you missed any of the details or you're confused about the games not that important we're just going to use it as an example and there's just two actions you can stick which means you keep your current score and then it's the dealer's turn and then the dealer will basically deal itself cards according to some policy and then at the end you'll see whether you've won or not alternatively you could draw which means you can't get another card but the risk is you go beyond 21 at which point you lose that's denoted here at the end in the reward for draw is minus one if your sum of cars goes beyond 21 and then the episode ends and otherwise the action is considered to be worth zero and this is some excuse me and a reward here is zero because we're essentially aiming for the win we're encoding each step during the episode as being worth zero because eventually you're going to have to stick if you haven't terminated yet because you've went beyond 21 and when you stick then the dealer will do its thing and if you won so if your sum of cards is higher than the sum of the dealer card you get +1 the same applies if the dealer would go beyond 21 in that case you've also been considered to have won if you have the same number of cards in this case we'll just define it to be zero you could also define that some cases the the dealer wins the tiebreak in that case you would have a minus one there but in this case we should find it to be zero and you get a minus one if the dealer has basically more points that you do now to me it's not immediately obvious what what the values are of a certain policy but you could define a policy and then you could just check by running Monte Carlo and that's what we did or essentially this is this is from the books it's not what I did but what this shows is the whole stage space so the two flat axes are essentially whether how many points you have how many points the dealer has and then it split out into two planes depending on whether you have usable ace or not so there isn't total 200 states there in basically the column on the on the left and there's again two on the states here in the column on the right the difference is that we ran more learning episodes essentially to here on the right and you see it at at some point it becomes very stable because we're just averaging so by the law of large numbers the variance of the estimates goes down and at some point you get a fairly reliable estimate of the of the quality of your policy now this example is in the book so you can go through it again at leisure if you want but the pointer is just to show that you can find these solutions so that it works and that these solutions might also look there's a little bit of interesting structure here that you might not be might not immediately be obvious just by looking at a problem so that was just an example of using Monte Carlo policy evaluation but it's not the only algorithm you can use and it's not always the best algorithm I mean this is a very small problem and we already needed to generate half a million episodes to get very smooth functions so what we'll consider here is something work into the dynamic programming which discussed last lecture the top two equations here are the same as what you've seen last lecture the first one is just the bellman equation that's the definition that's the definition of the value that we want to ask the second one is policy evaluation with dynamic programming which you've seen last lecture which basically just turns this first equation into an update and as we've discussed this will converge to the same values eventually but you need a model to do that you need to be able to roll out this one step now alternatively this is just an expectation that we can sample so a naive thing to do is just to sample this thing and then to plug that in as a new estimate that's typically not such a good idea because this will be noisy there might be noise in terms of the reward there might be noise in terms of the next state and therefore it's not immediately obvious and typically it won't be the case that this is a better estimate than the previous one it might might be just more noisy than the previous one so what's better is to average this target so essentially what we're doing is something very similar to before but instead of using the Montecarlo return we're going to use this one step and then bootstrapping on an estimate value as a stand-in for the actual value there's multiple ways to interpret it one way to interpret it is as an approximation a sample based version of the dynamic programming algorithm that we have right there a different way to interpret this is to see this as a return estimate where we've taken one real step and then we just replace the rest of the samples returned with this estimate that we already had at the next state but because we use a real sample it's a little bit better in a sense than if we didn't do that one step this is called temporal difference learning and the reason essentially is because you can interpret this whole part between your brackets you can interpret as an error between your current value and the targets and it's an error in the sense of a one step temporal difference so this is called a temporal difference error and then this is called temporal difference learning so to recap in approximate dynamic programming we used one step of the model and bootstrapped I put approximate here in brackets because you could apply these same ideas to construct the targets and then to have a parametric function that updates towards the target rather than the full tabular thing that we looked at at the previous slide when I move to Carlo we could just sample the full return we could use that and the idea of TD learning is basically to combine these two things we both sample but we also still bootstrap which means we use an estimate at the next state value now the first target here is an approximation of the real return because we bootstrap we're not guaranteed that the value at iteration K at the previous update essentially we're not guaranteed that this is going to be the same as the expected return from that point onwards that's the only approximation we are using the real model so for the rest there's no sampling error conversely in Monte Carlo there's no approximation error due to bootstrapping we're not using any estimates this is an unbiased sample for the actual value of that policy but we do get sampling error we just get a little bit of noise in some sense now TD learning basically gets both of these we're approximating in both of these ways however turns out there is less noise than a Monte Carlo and you don't need to reliance on the model that you're doing a dynamic programming and this tends to work quite well in practice so this is called using this as a target it's called temporal difference learning and this is a very central idea in reinforcement learning and basically use all over the place so it's important important to understand now this has a couple of properties as I said it's mobile free so you don't require any knowledge of the Markov decision process and it can also learn from incomplete episodes due to the bootstrapping this is something that may have not been immediately obvious but for the Monte Carlo return we needed to sample the whole trajectory we needed the whole return but if your episodes are really really long it might take a really long time for you to learn and in some specific cases you could even have a policy right now the just loops somewhere it like a robot that goes into a corner and bangs its head against the corner because it has a poor poorly learned policy so far and if the episode never really terminates then learning would never happen right as in temporal difference learning you could update after each in every transition which means you can sometimes learn much faster yes sorry yes sorry DM I should have been probably clearer about that earlier on an episode is basically a chunk of experience where at the end of the episode you reach the terminal state we see some examples in the previous lecture but maybe I wasn't as explicit about it as for instance the bukas the book has like a whole section on episodes in the bandit case each each interaction is one episode that that's just a terminology we use it's a very good question so in the grit worlds we've seen so far typically when you walk into a wall we say that the episode continues you just transition back to the same state you were but when you go to the terminal state we say the episode ends and you get transitioned back to whatever your start state was you could also have a start state distribution doesn't always have to be the same state you could also think about in in in practice a lot of experiments say for instance with if you think about a robot arm that tries to grasp something what people tend to do in practice is they put the robot arm in some situation and then it has to execute an episode in which it tries to grasp something and then whether successful or not the experimenter brings back the arm to its initial state and then we start a new episode so there's like a clear division there between one trajectory of experience which is kind of continuous and then there's a disconnect and there's a new episode what I said here is actually slightly more general just it doesn't just apply to episodes but it also means within long episode temporal difference when you can already learn and you can even learn when there's no episodes when there's just a continuing system because you can update from each and every transition okay so there's now kind of an obvious question how did this diode assist Monte Carlo method compared to temporal difference learning what are the properties of these two when should you use one when should you use the other so the goal again is still policy evaluation we just want to evaluate this policy for a certain given policy PI and one way to do that is to sample the return and to update a little bit towards that and the other way is to use a one step rollout where if now you know that this thing is the TD error the temporal difference error and update worse is one step target and to get a little bit of an intuition there's an example here from the book the situation is as follows somebody is driving home from work and the initial state is that this person is leaving the office the goal here is to estimate how long does it take to get home in this case we're not actually optimizing so we don't have to put the time we don't have to say for instance the time is negated so that you if you optimize time you actually minimize the time instead we're just going to predict how many minutes it takes and each time we're just going to measure the amount of minutes already elapsed and this will be a reward between each two states and then we could look at the full reward summed over the whole episode which starts at the office and ends when this person arrives home initially there are some prediction which is at this case 30 minutes maybe this is based on some past experience so far at the beginning of the initial state there hasn't been any reward so there's no there's basically no elapsed time any elapsed time is just the difference between the actual time that it took between each two states now the next state is when this person reaches the car and this has been five minutes which may not have been surprising but when reaching a car you also notice is that it rains and when it rains typically takes a little bit longer so the prediction is updated to it's still 35 minutes now even from the car so the total time total prediction is not 40 minutes from the beginning now a little bit later things went better than expected and we exit the highway the elapsed total time now has been 20 minutes so in some sense there's been a reward along the way a 15 and now we predict it's only more 15 more minutes because the highway went a bit faster than expected say which means the total time has gone down a bit to 35 but when exiting the highway this person finds itself behind the truck which is unexpected so even though 30 minutes have already elapsed it's still 10 more minutes expected to be before arriving home and then finally it even takes slightly longer than that because after 10 minutes we're entering the home street but we're not home yet and we expected three more minutes and this turns out to be correct this is just one sample trajectory where there's been some rewards along the way which in each of these cases are basically the differences in elapsed time in that first column those are the rewards and we see this prediction along the way updated a number of times because of events that happened now we can understand the difference between Monte Carlo and TD from this example by looking at what the updates they would do in the Monte Carlo case on the Left there's this actual outcome of 33 minutes and we would basically update the estimates from each state that we were in towards this now it should have been 30 43 minutes in total so for this we have to take into account what you elapsed time us or essentially we're only updating the values of each of these things towards the difference that we still need to get 2/3 of 43 but each of them gets updated in exactly that way towards that number so that the initial state for instance leaving the office gets updated swartz 43 now now if instead you were doing temporal differencing you get a different picture because in each state we would see the actual reward which is the elapsed time and then we would just update towards the elapsed time plus the value in the next state we're not discounting here for simplicity which in this case means we're updating like here on the right now it might not be immediately obvious which one is better but let me give you a personal example and a goat from last year I was going to one of these lecture halls and I had been there a number of times already but I thought I could take a shortcut so I went somewhere inside a UCL building that I had never been in before got thoroughly lost usually these interesting Maisy London buildings that you are probably familiar with and at some point I basically just gave up and I found my way back to the exit which was the way I came in so I lost a little bit of time there just reversing this maze but when I exited building I knew where I was again so I knew how far it was to get to the lecture hall now if I would have been doing Monte Carlo learning that would have meant that at the point of exiting the building I would have updated my estimate for how long it takes from the building to just this one sample of how long they didn't then take me to get to the lecture hall instead if I would have been doing temporal difference learning I could have bootstraps on this very good estimate I had on exiting so even let's let's assume this didn't happen let's assume that on my way to the lecture hall I may have been interrupted let's say I bump into somebody and I I talk to them for a while if I would have been using Monte Carlo learning this would have been incorporated into the only sample that I've ever used to update my values inside the building that have only been in once which is actually not a good thing to do in this case because it's just a noisy estimate it's much better in this case to boot from the value of the well-known state that you've been in many times and this is something that temporal difference methods can do so there's pros and cons but and there's more than there then and then there are on these slides will discuss more but as I said before one of the important differences is that temporal difference learning can learn before knowing the final outcome and it can also learn without knowing the final outcome in a fully continuing case so in temporal difference learning you would update after every step which means that even if in this if in the same episode you come back to the same state this might have been updated already you might act differently which is a definitely a benefit the other caveat all the way at the bottom that I wanted to just explicitly call out here so the monte carlo and he works for episodic terminating environments which is like this is the hard constraint in some sense but there's a softer version of this that it works better if you have maybe shorter episodes because it just means you get more data if you have very long episodes this means that you'll have very few data points in a sense this can also be understood as there being a bias-variance tradeoff where as in the example that I gave this one roll out from a state can be quite noisy and if you want to update towards that that might be the wrong thing to do and one way to just say that formally said it has a high variance this return now the temporal difference target which may or may not have been obvious is a biased estimate it's not it's not unbiased for the actual value because we're bootstrapping but the variance is much lower because there's only one action one transition that we're depending on to get the sample whereas in the monte carlo case we have all of these transitions and we're using all of them to get our sample and we can understand that by looking at a very simple problem in this case a random walk so the start state here states see in the middle and there's essentially essentially there's two actions go left or rights but you can also understand this as a Markov reward process without any actions where you just randomly transition left or right and then we can still talk the question what is the value of each state we're going to do this tabular so we'll just have a separate value for each of these states and we're going to update these the reward structure is very simple there's a zero reward basically everywhere except if you terminate on the right hand side then you get a reward of one let's say we initialize the values to all be point five at the beginning a half which is a fairly good estimate because on average they will be because half of the time you'll exit with a zero and half of the time you'll exit with the one but for the individual states of course it's wrong except for states see where it is actually you have turns out the actual values are actually I think fro from from from the left of the right is one over the number of states and then two over the number states figure from numerous states no that can't be right because you don't get a half so it must be three over six four see the initial values here are denoted here as a straight line 0 next to the straight line means this is after zero episodes then we've done one episode randomly and this is just so happens to be the case this is episode terminated on the left and what we see is that the value as updated with TV has updated and there's now a slight hinge in the line that points down all the way at left note that the rest of the line is still flat it's not that clear from this picture because there's other lines in here but the rest of the line that corresponds to this one is still flat we haven't updated any of the other states because we're doing TD learning and it will been bootstrapping so the value of state C got a reward of 0 and then went to state B Saye but the value of state B was also 1/2 so the value of 1/2 could update it towards 1/2 still a half so it didn't update there was question ok now we can continue and continue and at some point after 10 episodes in blue we'll see that the line is starting to shift and after a hundred episodes it's pretty close to the diagonal line which denotes the true values in this case it was run with a fixed step sighs and this means we'll never actually converge to the true line if you would decay the step size as you do when you average then it would actually go to all the way to the straight line but now we can do of course the same thing we can run this on the same problem we could run the Montecarlo algorithm and we did there is a different picture in the in the book I decided to run this one myself as well because I wanted to show you what happens when you tune the step size and also have a wider range of step sizes for both of these methods now their shades of color here I don't know how clear they are let seem to you relatively okay they go from lights which is a high step size to dark which is a low step size and you'll notice that in both of the case is the darkest line in black tends to be near the top and the yellow line also tends to be near the top which shows that intermediate step sizes tend to be better this is in general true as long as you have a wide enough range there tends to be a u-shaped form or if you want to optimize rather than minimize it's an inverse at you for what your best step size is in this case we didn't decay the step size we just kept the fixed step size for both of these methods yes yes a good question so how do we actually calculate that point there with the one yeah that's let's step through that it more explicitly so the update here is the TD update so we have a previous value which is 1/2 and we're going to update by adding step size times the targets minus this current value the target minus the value being the TD error and the step size here will speak to be 0.1 which means that we're actually moving exactly 10% towards zero here that's how much it dropped so it must have been 0.45 here because the target in this case will be your reward of zero plus the value of the next states but the value of the terminal state is always defined to be zero because you can never get any more rewards so the target here will be exactly zero if we would have exited at the other end it would have gone 10% up towards one which is well somewhere here I guess because there you would have gotten a reward of one and still the next state would have been worth nothing in later steps will actually update towards values that are more and more different so there will be these intermediate updates as well I'm not I didn't explicitly put the algorithm on the slides so the step size here is it this one it's also described in the book but essentially in this slide we just discuss a number of different step sizes including an orange the one at the bottom they're kind of going down quite quickly that's the step size of point one used for the previous figure and there's a couple of things I want to point out from this from this graph one is that the best TV methods do better than the most best Montecarlo methods for this specific setup but for each of these it holds that the dark dark one is at the top this is the lowest step size but it's very smooth and it tends to go down quite well this is just a step size that tends to be maybe a little bit too small for how long we've run it if we would run it longer eventually it will go down and on down and it would go to very good performance but if you only had 100 episodes in this case you should have used a bigger step size to get there faster but if your step size is too big for instance look at the yellow curve in both of these cases it stabilizes at a higher error because you keep an updating and we're not decaying the step size which means that we get a reducible noise in our updates it is and one thing to note here is that the noise in the Montecarlo returns is bigger than the noise in the temporal difference returns the variance here is higher we have a little bit of bias in a central difference case but the bias reduces over time because our values get more and more accurate which means that some follically it'll actually reach the true values both of these do but the temporal difference case the bias is only transient and the variance is lower which means that for certain step size is like 0.1 it just does a lot better it still doesn't all the way to zero the error for any of these step sizes because we still have the irreducible noise related to the stochastic updates for it for the error to go to completely to zero we would have to decay the step size all the way to zero slowly as well just to be completely clear what's on the curve here this is the root mean squared error over all of the states so if it would be zero that means we have all of the state values estimated exactly exactly right any questions now there's other differences between Monte Carlo and temporal difference rating and this one may have this one especially may not be immediately obvious the bias variance I think is fairly intuitive if you think about it the Monte Carlo return is just noisier so there's a higher variance but it's unbiased but there's also a difference that you can look at when when you consider what happens if you only have a limited number of experiences so in this case we sampled K episodes capital K episodes each of which may have lasted a number of steps so here left to right means we're going through time in each episode and then each step each row to the to the bottom will be another episode and the idea here is that we're going to look at what these algorithms do if we only have a limited set of experience and we repeatedly show the algorithms this experience all the way until they're basically done learning whenever much they can learn from this experience and then turns out the algorithms aren't equivalent even though both of them go to the same place if you would have infinite experience and you appropriately DKR step size if you have a finite amount of experience you learn all that you can using these specific methods they don't find the same answer and to understand that we'll look at a small example in which we have eight episodes and these episodes they're all only ever got the only only ever are in two different states States a or there's no discounting and turns out we only ever were in state a once the first episode was we were in state a we saw a reward of zero we went to state B then we saw a reward of zero and their net worth determination second episode is we started off in state B apparently there was a reward of 1 and it terminates and then there's a lot of similar episodes where sometimes the reward is one sometimes it's zero now the question is what's the value of a what's the value of B does anybody want to answer maybe let's start with the value of B who wants to make a guess yes not 0.75 is is a good estimate here we're just taking the sample average of all the rewards we've ever seen from stay P I think it's hard to do better than that and now the next question is what's the value of state a zero is one suggestion does anybody have a different suggestion I hear some whispers I don't hear any answers so there's two basically valid answers here in a sense one is to say zero because all the returns we've ever seen from state a have resulted in a total return of zero that's as I assumed that was the reasoning behind the answer of zero there's a different thing you could assume each time we were in state a we transition to state B but state B we've estimated to be worth 0.75 because we've just it's basically six or eight but if each of the episodes that ever were in state a always went to state B and we got a reward of zero along the way maybe this should be the same value so I put open seven five and it's not immediately clear which one of these is correct now there's a way to characterize what these two different things are and it turns out Montecarlo and temporal difference methods go to these two different answers the Montecarlo answer will be 0 because Montecarlo will just average all of the returns you've ever seen in this example the only return we've ever seen from say day was 0 so the Montecarlo estimate will be 0 the temporal difference estimate won't be because it turns out temporal difference learning in this batch setting where we have a limited amount of data but we replay that data over and over until you figure at all that we can it turns out the temporal difference learning then converges to the solution of the maximum likelihood Markov model which means it basically finds the solution of the empirical Markov decision process where all of the probability distributions are just the sample averages of the things that happened in this case specifically that means that the probability of ending up is state B from state a is estimated to be one because we've never ever seen an example where a state a didn't go to state B also the reward on that transition is estimated to be zero because that's the only reward we ever saw from A to B in that case that means that the value of state a must be the same as the value state B which is 0.75 so this is kind of interesting these two algorithms if you just run them indefinitely on this data they come come up with different answers now what does that mean in practice which one should you use well the temporal difference learning algorithm you can interpret to be able to exploit the sequentiality and the markov property of the opening environments it essentially builds this empirical model in a sense without explicitly building it but defines the same answer Monte Carlo doesn't exploit this property it basically just says I don't I don't care how this thing is wired up I'm just going to sample from each thing whatever happened in a case I'm not going to use the fact that I know if one state came after the other that these values are somehow related perhaps I'm just going to set each of them as if they're independent now this is a big potential benefit for temporal difference learning especially if you're actually in the Markov decision process later on we'll see that in some cases this is maybe violated a little bit for instance because you do function approximation so even if your state would be fully observable you can't actually tell but it's even clearly in the case where your your your state is partially observable so you can't observe the full environment state in that case maybe it's wrong to assume this Markov property and we'll see that sometimes it's better to go a little bit towards Monte Carlo I say a little bit worse because there's actually algorithms that go in between these which we'll cover in a later lecture so here's a depiction of these algorithms schematically so one way to understand the Montecarlo backup is that there's this big tree of possibilities that might that might happen and we're going to update this value somewhere as a root of a subtree of this maybe bigger tree we're going to update it towards the sample return all the way towards the end in the previous lecture let me do that one first we had dynamic programming which went one step but it looked at the full tree which requires you to have the model to be able to build all of these things now temporal difference learning does the intermediate version where you only take one step but we don't use a model so we only ever have this one step so we sample this but then we do bootstrap and we use a value estimate at each of these states this is akin to using heuristic values when you do search the using of these values along the way we call bootstrapping which involves updating towards an estimate which is not just an estimate because it's noisy but it's an estimate because it's a for instance a parametric function it's estimating the state value now Montecarlo does not bootstrap which is an advantage is in some cases and it is a disadvantage in others and dynamic programming and TTD both bootstrap in terms of the sampling multi-car obviously samples temporal difference learning doesn't sample MTD as set before bootstraps and samples and these are the main distinctions between these methods and you can depict that a little bit like this where exhaustive search is all the way here at the top right we haven't really covered that much but it would be if you don't do the dynamic programming thing but you just build your fool tree if you could for a problem and you search all the weight rolls all the leaves and then you you you back up everything now dynamic dynamic programming doesn't do that it takes one step but it does use the fool model and the other dimension is whether you sample or not so exhaustive search is often not used instead typically we use Monte Carlo methods these days because we're interested in problems with huge search trees in which case you just can't do the exhaustive search for instance the game of Go is a good example of this where there's this huge branching factor in you can't do exhaustive search similarly that in dynamic programming you could view temporal difference learning as the sampled version of dynamic programming or you could view it as the bootstrapping version of Monte Carlo and all of these are valid and turns out there's actually also algorithms in the intermediate here and as a set we'll discuss them later but we'll leave that for future lectures okay so now I want to move towards model free control so what we've discussed so far was all estimation we're doing prediction we're estimating the state values but of course we want to eventually use these things to get a better policy to make you aware or as a warning depending on how you want to interpret it some of the material that we're covering right now is also what you'll need to understand in order to do the second reinforcement learning assignments so that might be useful to know but first we're going to do a refresher which is something we've just talked about last lecture which is the idea of policy iteration the slider says generalize policy iteration is what we what we kind of talks about which is the idea of doing this interleaved process of estimating the value of a policy and then improving that policy it's called generalized policy duration rather than just policy iteration when you basically approximate either or both of these steps so we don't necessarily fully evaluate a policy and we don't necessarily fully improve it but if you do a little bit of both or a little bit of either of these it may be fully the other one then it's a case of generalized policy iteration and the idea there here was that if you follow this process and you keep on improving your policy eventually you'll end up with the optimal policy in the foody tabular case okay so this is just a refresher from what we discussed last time what we're going to apply this idea of policy iteration on the in the setting that we're in right now and first we're going to discuss the multi Carlow case so what we'll do right now is we'll take Monte Carlo as a policy evaluation algorithm and we're going to use that to also improve our policy in case you just came in you haven't really missed anything yes don't worry so remember we just reduce in Monte Carlo we have a return in expectation this is equal to the true value so that we can use this to evaluate a policy and then we can improve our policy however there's a small immediate bump that we hit because what we did before is we want to greenify but we were using one step roll out of your model which we don't have here for CV there's also fairly easy fix if we just estimate action values rather than state values we can just maximize immediately and it's much easier we don't need them all so this points to something that we'll see quite often is that in is that when you do control people much more typically estimate state action values than they do state values because it's much easier to get a policy out of it you just maximize this de the action values so there's no obvious thing to do which is we do use Monte Carlo policy evaluation to estimate the action values we don't need to go all the way because we do generalize policy to iteration we don't need to fully estimate it but we just improve our estimates a little bit with policy evaluation and then maybe we greed if I but that's probably not a good idea because then we're not exploring and this is something that we've discussed in the second lecture at depth you need to balance exploration and exploitation and this is especially true when you do Montecarlo methods because we don't have the full model so if you never go anywhere you won't learn about it you need to explore to learn about the world for to me again there's a fairly easy solution maybe we just explore a little bit for instance with epsilon greedy turns out the generalized policy iteration idea still applies you don't have to go fully greedy it's enough to be greedy or with respect to your current values then you used to be now if you have an epsilon greedy algorithm you estimate its policy and you get a new value out of that if you then become epsilon greedy even with the same epsilon towards this new values it is still Apollo's me policy improvement step and you can still show that this is a valid thing to do there's a short derivation in the book that shows you that this will converge and then that sends to epsilon optimal policies if you keep your optional fixed now if we want to become truly optimal we need to decay the exploration eventually and the acronym Billy is quite often used in a literature which is short for greedy in the limits within with infinite exploration and what this essentially means formally in the tabular case is that we're going to explore in such a way that we're going to try each action in every state infinite times in the indefinite future doesn't say anything about a ratio she might try certain actions way more often than others but in the long run you have to really try everything infinitely long because we're going to average for sampling so you can't have like complete guarantees of optimality without seeing everything indefinitely often infinitely often the this is the infinite exploration part the greedy in a limit means that eventually you're going to be greedy and one a simple example of an algorithm that does that is to do epsilon greedy exploration with an epsilon that decays for instance with 1 over K where K is the number of episodes you've seen you could also use 1 over T or something like that if you don't want to count episodes but rather counting steps and this turns out to be sufficient to get you these properties and this means that algorithms that for instance the Montecarlo policy evaluation thing will eventually converts to the optimal values so the algorithm is then very simple in some sense because we're doing everything fully tabular we're just sample a in episode we're just going to do fully episodic case we're doing Monte Carlo so everything breaks up into episodes also only applies in episodes we're going to increment the counter for each of the actions that we've seen and we're going to update the value for that action towards the return for simplicity I'm not talking about what happens when you are in the same state multiple times in an episode there's more about that in the book for simplicity you could just assume maybe each day's an action in a trajectory is unique then you don't have any issues you just average you bump it to most ones in that episode and then we improve the policy by decaying the epsilon maybe a little bit 1 over K and then we pick a new epsilon greedy policy with respect to our current Q value estimates after this episode now there's a theorem that if you use this the epsilon choice here guarantees that it's CLE greedy in the limit with infinite expiration then the Montecarlo control algorithm depicted here will converge in the limits to the optimal action values which means that if you have those you can easily pick the optimal policy by just acting greedily with respect to those action values this is quite cool we can get the optimal solution but maybe we want to so there's maybe you select catch here in the sense of the normal Monte Carlo catch that this can take a while maybe you need many episodes maybe the episodes are long maybe there's a lot of noise so maybe we want to use temporal difference learning again it has a lower variance you can learn online you can learn from incomplete sequences you could also learn if there really are no episodes if there's just this one continuing stream of data life so a natural idea is to use temporal difference learning instead of Monte Carlo for control which is just to apply temporal difference learning to this queue estimate and maybe if you use epsilon greedy whenever whenever you are in a certain stage you just check your current estimates and you're absolutely agree with respect to that which you might update your estimate every time step there's a natural extension of the TD algorithm that we already saw for state values which just applies exactly the same idea to state action pairs so we have a state action pair at the top we have a transition we see a reward in the next states and then we also consider the next action according to our current policy whatever that is and this gives us an update that looks a lot like the app that we had before for state values in fact it's exactly the same except that we replaced each occurrence of the state with the state action pair this is called because of delay the way these letters spell out that word sarsa state action rewards state action sarsa yes so yeah let's clarify epsilon greedy again so the question was how does this explore in different directions if you use something like epsilon greedy so epsilon greedy is a very basic exploration method and just to recap what it does it picks the greedy action with a probability of 1 minus Epsilon and it picks uniformly random across all of the other actions with probability Epsilon so it'll it indefinitely try everything from every state but it does it in a way that is sometimes called jittering it doesn't like direct that go somewhere it just tries an action and then maybe tries the opposite actually the next step this could happen in some sense this is maybe the simplest exploration algorithm that you could consider okay so we have two sarsa algorithm now with a step size make small steps because we're sampling and then we could do generalize policy iteration again where we use Saoirse as a TV method to improve our estimates on each time step a little bit we're not doing it on an episode by episode basis now we're doing an Ono time step by time step basis and then we just continue to explore with epsilon greedy with an upstream green policy which means we're doing epsilon greedy policy improvements each time the values change we immediately adapt our policy to be absolute greedy with respect to that policy which is an improvement step over the value sorry the policy we're using before this is the fool algorithm we just initialize some Q values in this case there are capitals because you could also interpret them to be just random variables if they're stored in a table this is the convention used in the book I mostly use the convention that the Q values are all small letters because it's a function even if it's a table but for the tabular case you could basically use either in a sense and the algorithm just works by for each episode initializing some states from some states distribution this is basically part of the MVP that picks some star state for you then you select an action according to the policy derived from your current estimates for instance epsilon greedy and then you repeatedly update these values by taking a new action observing a reward in the next state also already choosing a next action in that next state and now you can update and you assign the state and the action in order to make the loop well-defined so you basically call SS prime e you call a a prime or the other way around sorry and then you can repeat this loop so the action is chosen but then actually executes it later in a sense we already picket we commit to it now what this algorithm does it learns on policy and this is an important distinction that we're going to make right now on policy learning means we're learning the value of the current policy this is the typical case in policy iteration where you want to estimate the value of the current policy and then basically in a separate step you improve maybe that policy in the case of SAR so these steps are interleaved at a very fine-grain on the step by step basis but we're still on each step learning about a certain policy in fact you can and people who have many times you can use sarsa to just do policy evaluation to have a certain policy just act according according to it but maybe for some reason you're interested in the policy sorry in the value of that policy and maybe also for some reason you're interested in the action values of those policy rather than the state value but maybe not necessarily to do control you can do that and then you would learn the value of that policy the opposite of this or like the the the inverse of states in some sense is off policy learning off policy learning means learning about any policy other than the one you're following so there's now new notation here we're still using PI to denote the policy that we're learning about which is in both of this case a case is the same but there's a new notation B which is our behavior policy which may or may not be the same as the target policy PI I'll get back to that but before I do let's go back to the dynamic programming ideas from previous lecture sorry there's a there's a slight mistake on the slide I'll fix that before I upload the PI here should be okay and the star here should be okay because these are the the updates not the definitions so the first one is just policy evaluation with dynamic programming the second one is is what we call value iteration which as we discussed in the previous lecture is policy iteration where you each time greed if I after doing exactly one update to all of the state values potentially and the bottom two are just the action value equivalence versus versions of the first two or in this case I 14 he did denote them correctly with the bootstrapping on the value at the previous iteration now there's analogous TD algorithms the first one we discussed where again this should be a T and again in the action value case is correct and we can now see that we already discussed two of these we can discuss TD learning I discussed sarsa by the way TD learning is often used to refer to all of these but people also use TD to only specifically refer to the top one where we just sample as states so we have a step size to be able to smooth out the noise and we don't need a model but we could also sample based give you Moulton one at the bottom here which is value iteration which is trying to do something else these two are evaluating a policy this one corresponds to TD learning this one corresponds to sarsa but what is this one n correspond to and it's an algorithm called Q learning and before I talk more about cooling note that there were four equations here's here's the fourth one there's no trifle trivial analogous version of that one and someone say why I'll go back to the previous slide because that might make it clearer so the only version or the only one we don't have a version of is this one now what's different for this one compared to the other ones yes you can't sample it because the max is on the outside of the expectation that's the main difficulty now obviously there is ways to they may be approximate that but it's less trivial this is why I said in the next slide there's no trivial way to extend it all the other ones you can directly sample their expectations of something don't particularly care maybe what we can just sample this one's harder because there is a max outside of your expectation in fact if you would sample this thing on the inside you can't do the max of reactions anymore because you only have it for one action ok now as I said off policy learning is about evaluating some target policy while following a different behavior policy and this is important because sometimes we want to learn from data that exists already but we want to learn what would happen if we would execute a different policy for instance we might have data by observing say humans or other agents they would generate some trajectories of data but we might not necessarily be interested in evaluating what they did we might be interesting to learn what would happen if we would have done something different and this is possible if you do off policy learning also you could reuse experience from all policies if you do off policy learning let's say you store these experiences you could replay them and then you could still learn from them if you can learn off policy because your current policy might be different the thing we're going to talk about next is you can learn about the optimal policy while performing an exploratory policy and in addition you can actually learn about multiple policies if you want while following one now Q learning can now be understand stood as estimating the value of the greedy policy because there's this max operator here note that this thing was conditioned in the value iteration case on the state and an action SC and a tea which means that the reward in the next state or basically already conditional on that action so it doesn't really matter which policy you are following a state SD because you will just update the appropriate action value anyway but at the next state we need to consider a policy and in this case we consider the greedy policy which means that this will learn about the greedy policy even if you don't follow it and that's very nice because it means that q-learning will convert to the optimal state action value func function as long as we basically explore indefinitely but we no longer have to be Li we no longer have to become greedy in the limit in fact we don't need to explore at all in some sense if we just have a data set in which all of the states and actions have tried infinitely often which is maybe generated while you sample from it it doesn't really matter how it was generated as long as everything's in there often enough Q learning can find you optimal action value functions so in some sense we're decoupling the thing we're learning about which is the optimal value from the thing that we're doing which is the behavior now this is a practical example for what that means also from the book where there's a simple great world set up and it works as follows you start in the state s there at the bottom left and you can just do your typical great world each actions around but whenever you fall into this region denoted the cliff your episode will terminate you will start a new episode in state s again but you will get a reward of minus 100 because it hurts you don't want to do that and on each of the other steps you get a reward of minus 1 which means you do want to terminate you do want your episode to end but you just don't want it to end in the clip you want it to end at the goal cuz then they're hurting stops now there's an optimal policy here which is just a step up once then walk all the way across the cliff and then go down into the goal and in fact this is what q-learning will learn it will learn the values so that if you are greedy with respect to those values it will follow exactly that path but now let's assume that we're not actually following that greedy policy instead where you're following the exploratory policy an Epsilon greedy policy with a certain Epsilon I think in this case it was 0.1 in that case q-learning will still learn that this is the best thing to do and with a probability of 1 minus epsilon for instance it will try to move across on each step with one minus Epsilon try to move across the cliff but on each step along the edge there is an epsilon probability of selecting a random action and one of these actions is down which means it might just fall off the cliff the algorithm is in some sense blissfully unaware it doesn't necessarily care because it's it's it's goal is to opt to to find the value of the greedy policy so whenever you select a non greedy action it doesn't particularly care will update the value of that action and it'll note oh that's a bad idea but it's estimating the values for if you wouldn't be doing that conversely sarsa is an on policy algorithm and it's actually estimating the value of the actual policy you're following including the absolute exploration now turns out if you then run that algorithm and then at the end your friend you look at what's the greedy policy with respect to the action values that Saoirse found it turns out to go up a little further and then walk all the way towards the goal and then go down again basically leaving a safety buffer between itself and the cliff this is because the learning algorithm is aware of the exploration that's happening while it's learning the action values capture the fact that you might take an absolute action at some time steps down and therefore the action values will be lower near the cliff so the greedy policy learned by Saur Saur Saur in the end we'll walk further away from the cliff sometimes this is a good thing sometimes this is a bad thing sometimes you want to find it the policy that traverses is very narrow path sometimes you want the policy that's more robust to the exploration that might be happening but this is a clear distinction the graph here at the bottom shows the reward per episode while you're exploring while you're doing the epsilon greedy policy and q-learning is notably worse in that case however if you just evaluate the policy that they found in the end by taking the action values at the end of learning and then doing the really policy with with respect to those q-learning will be better because it's found a shorter path so depends what you're after whether you find one to find the optimal policy or whether you want to find a policy that's robust to the exploration that you're using know also that in some cases q-learning might not reach the goal that often in this case because it tends to fall in the cliff a lot which also might hurt learning or learning speed in some cases because it might just take you very long to get somewhere because the algorithm doesn't take the safe route so these things might also impact learning dynamics okay we're going to shift just again so feel free to stop me if you have questions about q-learning and sarsa also of course as always if you have questions later please let us know for instance for instance on Moodle others might have the same questions and it's very helpful to get these questions as well because it makes me understand what it's and what isn't clear and maybe I can then be clearer next time around I want to talk about an issue with classical q-learning and by doing so I wanted to make it a little bit more explicit what key learning is doing we're bootstrapping on the maximum action in the next state I pulled out this term from the Pewter learning algorithm and I basically just expanded it here didn't do anything else this is an equality I just expanded it into saying what does this actually mean well what this means is we're actually evaluating with our current estimate in the state at t plus one we're evaluating the greedy policy that's what this part means in that state we're taking the maximal action maximally valued action according to the same estimates in that next state that's our policy that we're considering this is our target policy not necessarily our behavior policy but this means we're using the same values QT the same estimates to both select and evaluate in action but these values are approximate they're noisy because we sample to learn them we're also bootstrapping we might be doing function approximation later on there's many reasons why these values might be a little bit off but if there are approximate then actually this means that in general are more likely to select overestimated values with this arc max then we are to select underestimated values so if we're more likely to select an overestimated value and then we use the same values to evaluate that action it'll look good one way to think about this is if you have many many different values just abstract away States actually than everything else just think about having 10 different estimates for 10 different cool Gaussian distributions some of these estimates will be low some of these will be high let's say the actual mean is zero in all cases if you just have finite sample averages for each of them some will be a little bit below zero some will be a little bit above zero zero but if you take the max it will typically be above zero even though the actual mean is zero that's what's happening here and this causes an upward bias now there's a way around that which is to be couple the selection of the action from the evaluation they're off one way to do that is to store to action value functions let's say Q and Q prime and if you do that then we can use one to select in this case Q to select an action and we use the other one Q prime to evaluate it what this is essentially doing is we're saying we're using the policy that is greedy with respect to this one Q function Q and we're evaluating with the other Q function Q Prime or we could do TR the inverse of this where we pick the policy the policy according to the other one but each time we select according to one and we evaluate the other and then what we could do to make sure these things are actually distinct estimates of the same thing is to randomly pick one whenever we do an update we just update one of these action values for each experience so they have disjoint sets of experiments experiences that they're learning from this is very akin to cross-validation where we basically split up the data into two folds in a sense and the overestimation bias akin to the problem of validation is trying to solve where we're trying to solve the over-optimistic guess that we might get if you would just validate yourself on your train set it's a good question do you need to go double q-learning do you need to do triple Q learning many more fold Q learning I have investigated this in the past I looked at things like 5 fold and 10 fold Q learning and this is a couple of years ago so I forget what the exact results are I think somebody else followed up on this as well with a different paper and there might be a trade of there there's different ways also to do this because you could imagine if you have many folds do you use many to select and then want to evaluate or the other way around do you use maybe one to select and many to evaluate it's not immediately clear which one is better so there's a couple of design choices there so in short the answer is you can definitely do that whether it's better or worse is unclear and maybe it also depends on the domain so can you do something similar where you just basically have two Q learning algorithms each in a separate simulation is that the question and the answer is no you have to have this interleaving where they know about each other if you would just secure learning into separate simulations you would get to over estimated estimates so you need to do this where you take the max essentially good questions thanks by the way just one thing to note we're splitting up the experience to learn about these two state action values but that doesn't mean we're necessarily less efficient in terms of how good our estimates are because when we want to act say how good our policy is when we want to act we can just combine them at the bottom Mara said you can add them together that's okay for acting if you want a good prediction of course you should just average them rather than adding them together but if you're only interested in picking an action according to them you can just add them together so behavior here can be different from what we're using here to estimate we're estimating two different policies one is greedy with respect to one then one is greedy with respect to the other action value what we can act according to a combination of these and that's perfectly fine because we're doing off policy learning now does this work is this a good idea so this is an example I guess I could have just put this one thing I used to have this slide first your one and then the other but it's a bit of redundancy now this is in a rillette case there's a single state there's a lot of actions that basically go back to that state because you're at a roulette table is to setting the aliens at a roulette table there's many different gambling actions there's like 170 approximately and in actuality all of these gambling actions have unexpected slightly negative reward but they're very noisy each time you can continue betting for simplicity let's say there's no state so there's no you can never go bankrupt but maybe you can leave the table maybe there's also an action that just quits which would terminate your episode and this is the optimal action to take because you would stop gambling you losing money now what happens if you run q-learning your estimation is so large that if you do a synchronous q-learning update on all of the actions at the same time and you do a hundred thousand of these so in total you'll have done 70 million updates to a tabular setting with just one state there's no for estimation of more than $20 even though we're betting a single dollar on each step which means the expected return is actually like negative five cents so more than twenty is way off is unrealistically far off you very rarely get more than $20 from betting a single dollar now of course this is the long-term return I think the discount is probably point nine or something like that or 0.99 I don't I don't remember but still it's it's unrealistically high and it's also positive which is wrong because it means that the optimal policy is to leave the table but this algorithm will estimate that it's much much better to stay at the table and keep on betting conversion if you use the double Q learning algorithm from the previous slide you'll actually find that it very nice to estimate something very very close to the actual value and it would also learn to leave the table you can make this more extreme by even paying the agent for instance $10 to leave the table q-learning would still keep on gambling w learning very quickly check take the money and run this is a slide slightly contrived example because there's only one state and it's deliberately very noisy just to show this so you could question does this actually happen a lot in practice so one thing we did is there was a version of Q learning that was used for the initial results on Atari with dqn this got got some scores on games which were quite good and people were fairly excited about this but then turns out if you just plug in double Q learning into that whole system this whole agent with lots of bells and whistles but you just replace the Q learning up side update inside there with double Q learning you get the blue scores here compared to the red scores that DQ angle and indeed not showing that here but you could show that in many of the games there were very clear and realistic over estimations of the values so this does happen in practice interestingly the Atari games that this is run on are deterministic so in that case the over estimations didn't come from the noise in the environment they came from your approximation errors and the noise in the policy that updates these your values would still be approximate and still pick taking a Mexico over these would calls offer estimations the difference is here were much bigger than I expected them to be by the way so why are there some games in which it does worse a lot of this is just random and in some cases there's also there so there's maybe slightly more precise answer as well in some some cases like these things will go up and down a little bit if you run it again because these take a lot of or not as much details anymore but it used to take quite a bit of compute and time therefore to run so we couldn't run that many repetitions of the experiment but the other explanation is also that these algorithms will actually have different properties and in some cases the offer estimations might actually be helpful in terms of the policy that they make you follow they might steer you towards certain places that just happen to be good for that specific game so even though in general it's bad that doesn't mean it's always bad it could be good and that's what happens here maybe on tennis we see this by the way much more often especially in deep reinforcement learning here deep networks were combined reinforcement learning what very often happens is that we have if we have an improvement in terms of algorithm many many games get better and if you get worse this is generally the case because these games are fairly specific and some games will just basically have worked well for other reasons than the algorithm being good or it they may have exploited in sometimes a flaw in the algorithm to still get better policies in the end so in some sense the way to look at these things is not to look too much at score for each game but tomorrow look at the general trend yes very good question why don't why do the offer estimation stick why don't we unlearn them actually we do unlearn q-learning in this case this is a tabular set up is guaranteed to find the optimal cue values what I didn't say is how quickly this line that seems to be plateauing here still going down and eventually this was run with a decaying step size and eventually it would find the optimal solution but if I gave it a 70 million updates already and it's this far along that's probably a problem so the answer is yes cue learning is sound theoretically and it will find the optimal policy there's just ways you can improve this in in practice by the way double cue learning is guaranteed under the same conditions as cue learning to also find the optimal policy in this case it just goes there much faster yes so the question is essentially what's the what's the convergence rate of q-learning and how can you reason about that there's definitely been some work on this mostly for the tabular case in which people have shown basically how quickly q-learning converges and this depends on many things as you might expect but the most important things are your your behavior policy also the structure of the MEP whether the MEP itself is connected in a sense that you can easily cover all of it and also very importantly on the step size that you pick because you come to because of the bootstrapping you count to the flat average thing that we were doing from Montecarlo you need to do something else because the thing we're at we're we're the thing we're using to estimate our optimal value is changing because of two reasons one is the the values of bootstrapping on and the other one is the policy that we're following which is kind of covered with the exploration but if you you can take account of all of these things and then you can derive basically rates at which these things converge and that's all ready for the tabular case quite complex for the especially for the deeper L case basically nobody knows very good question thanks by the way I'm I'll be a very happy to give concrete pointers to papers for people who are interested in these sort of things okay so now we can yes we can quickly still cover the final topic of this lecture and for this we're going to this is still in context of off policy learning we want to learn about a different policy than the one we're following I gave you an example of an algorithm that does that q-learning estimates the value of the current greedy policy while following maybe something else but we maybe want to do it more generally and another way to think about what I'm going to explain next is a different way to come up to come up with the Q learning algorithm but will also find actually more general algorithm than that so this is a very generic statement here we want to estimate a certain certain quantity X here is basically random I probably should have made it a capital and it's it's sampled according to some some distribution D now we can just write this out you could turn it into an integral if you prefer in this case I just made it a sum so there's a finite number of elements X and each of them have probability D of occurring now the expectation can then simply be written out as this sum the weighted sum of FX for each X with the probability that it might occur what we're actually interested in is something else we're following a different distribution we're sampling sorry from a different distribution D Prime and we want to still estimate this thing but using the samples from the other distribution and for this we can use something called important sampling which relates the distribution and one way to see that is by simply multiplying and dividing by this D prime X noting that this is again an expectation but now it's an expectation with respect to the distribution D prime which means we can write it again in expectation notation but the fraction remains make sure that you follow it that part so what this means actually let me go to the next slide first because first I'll just apply it to the reinforced learning case this is exactly the same as we had before I just plucked in some some familiar quantities so now we're trying to estimate the expected reward say let's just take the one step reward for simplicity bandits say and we want to estimate this quantity the expectation under a certain target policy PI now we can write that out as the weighted sum of the expected rewards given a state in action these you can compute if you have the MVP from the distribution of the rewards but this is just the definition we say this is the expected reward condition on SMT and then weighted by the probability of selecting that action a we can do the same trick multiplied and divided by the probability of the behavior selecting that action which is again an expectation so we can write it out as an expectation which means we're multiplying the reward with the probability of selecting the action according to the target policy divided by the probability of selecting it under our behavior this means that we can sample this thing at the bottom and get an unbiased estimate for the thing at the top so that's what we'll do we're following a certain behavior be we'll sample this quantity target policy divided by behavior policy times the reward and it'll be an unbiased sample for the expected reward under the target policy now note as a technicality but it's an important one that B of course can't be 0 for this to be valid which has an intuitive explanation you can't learn about behavior that you never do so if you want to learn about the target policy that selects certain actions you have to at least have a nonzero probability of selecting those actions yourself as well and then everything is well-defined a different way to say that is that the support of the behavior distribution needs to cover at least the support of the target distribution now we can take this exact same idea and we can apply that to the sequential case and it turns out there's more that you can be said about this but I'm just going to skip to the conclusion which is it just means that you have to multiply all of these things together for all of the actions that you did along the trajectory let's say we had a trajectory up to time Big T then this product goes up to Big T as before so this is for a full episode we're doing Monte Carlo we're just multiplying together all of these ratios and this turns out to exactly appropriately wait the return to make it an unbiased estimate of the return under your target policy so then we can just update towards that using our standard Monte Carlo algorithm we update towards this target there's no bias this is an unbiased estimate for the true value of the target policy however it can dramatically increase variance now what is a very simple way to see that well let's assume that our target policy is actually a deterministic policy that deterministically selects a certain action in each state and let's say our behavior policy is more random it explores for instance if your episodes are long that means that one of these pies is very likely to be zero which means your whole return will be replaced with a zero however there is a couple of episodes in which the return will be nonzero but the probability of selecting exactly all of those actions may have been smaller than one for each of these if you're apps if your behavior prints is Epsilon greedy then this be here at the bottom is at most one minus Epsilon which means if you have a product of many of these we're dividing one by one minus epsilon which is a quantity higher than one and this is in the best case in some cases it will be epsilon so we'll have 1 divided by epsilon which is even higher which means this full product will be very big what essentially is happening here is we're down weighing trajectories that the target policy would never do all the way to zero in this case and we're up weighing the trajectories that the target policy would do which are maybe very rare for the full trajectories and turns out we're rewriting them in exactly such a way that on average we get the right answer but if there's only a handful of episodes that ever really follow the full target policy this is going to be very noisy if your episodes are very long because you're going to be very noisy another way to say that has very high variance it's that clear so essentially what I'm saying is maybe very insistance sorry maybe bias is not the only thing to care about even though it's unbiased might not be ideal there's maybe a now obvious thing you could do which is to not use the full return instead let's use news TV in that case we only need to re wait with this one important sampling ratio if we estimate state values because we only need to re wait for the probability of selecting this one action according to the behavior policy rather than the target policy this is a much lower variance because we don't have this big product of things that can be higher than one we only have one of them also means you can learn even if the policies are only similar for one step rather than needing trajectories in which you just happen to pick the same action as the target policy on each and every step of them so whenever you have a transition here within an episode in which these actions match you can already start learning from that so this is a much more efficient algorithm often now you can apply that same idea to action values but turns out it's even simpler because we don't need too important sample if we do the one step the important sampling is so important to try to wrap your head about rounds because it'll come back essentially when you do multiple steps in some cases but for the one step we can essentially just be can be his pick an action according to the behavior policy but we're not just going to ignore that and instead we're going to re-weigh the action values in the next state using the target policy this quantity here one way to understand it let's assume for a moment that Q is a good estimate for the policy of the target policy if that were the case if you then would do a weighted sum according to the target policy this quantity as a whole would indeed also be a good estimate for the state value of that policy if that is the case then Q would remain a good estimate for the target policy so in some sense this is sound but you can show that this is the X exactly also the right thing to do that was the more intuitive algorithm and this algorithm is called expected salsa for the reason that you can also interpret this as being the expected update that salsa does which was just a non policy to the algorithm if you take the expectation with respect to the target policy of the action that you select in the next state this defines the expectation because we're already conditioning on the state in the actions they have the same noise sarsa and expected stars in terms of the next state and the reward but we're already conditioning on the state in the action so we don't need to correct for the policy for this state for the policy for action 80 because we're already conditioned on that action we're only ever updating the value of that action whenever we select that one and turns out expected salsa lazy has the exact same bias as sarsa bias only comes from the fact that we have an approximation Q here the typical TD bias but it's a fully on policy algorithm so as long as the target policy matches your behavior policy but it's a little bit more general than that because if your behavior policy is different from your target policy which is the case when B is not equal to PI then we can call this generalized Q learning and a special case of this is then when this target policy is deterministic so then you pick out one and more specifically if it's greedy you get exactly Q on your Mac so this generalizes both in some sense both salsa which is the sampling based version of this as long as your behavior and your target are the same it's fully on policy and it generalizes q-learning where we're trying to estimate a different policy in this case the greedy policy now this is just taking those steps towards q-learning making that explicit we pick our target policy to be greedy and then turns out these expected updates the expected soursop it just becomes curing as I said on the previous slide and again q-learning you can just use whatever behavior policy you want and it'll still estimate the value for the target policy which in this case is the greedy policy with respect to your current values as discussed before so now we kind of came from important sampling we went and we got the Q learning back whereas previously we just sampled a value iteration update from dynamic programming and we also got Q learning and it's the same algorithm in both cases but you can arrive it at a different ways one is basically coming from the Montecarlo side and the other one's coming from from the dynamic programming side okay so you have a little bit of time left for questions if people have them please feel free to ask because if you do have a question you're very likely not the only one with that question okay in other words before I end I just wanted to mention that there will be a new reinforcement learning assignment which will come out over the weekend and then you'll have a couple of weeks to or Monday I think Monday it says in schedule and then you'll have a couple of weeks to do all of these assignments including the reading week I think there will be a little bit of a while before we get back to the reinforcement learning lectures next week there will be two deep learning lectures and I think then those reading week and then the week after that we'll return and we'll talk about policy gradients and deeper ill and everything thank youtoday we will be talking about mobile free prediction and control and I'll be covering quite a lot of material and I will also get back to some of this in later lectures especially when we were considering function approximation and specifically of course we'll talk about deep neural networks at some points but not yet during this lecture sorry there we go the main background material is Sutton Umberto chapters five and six and just to recap again this is just a setting this is the same slide you've seen a couple of times where we're interested in the signs of making decisions and especially in a sequential setting where we can interact with the world repeatedly and to do that we could either learn a policy of value function or a remodel and in this lecture will mostly be talking about value functions similar to the previous lecture in a later lecture we'll talk more about for instance how to learn policies directly so the last lecture was about planning with dynamic programming in this we assumed we can observe and interact with the Markov decision process but we don't really need the sample because we have access to the full model so we can just plan I say just plan of course there's ways to do that that are more or less efficient and there's lots a wide literature on this topic but it is somewhat of a limitation to have this assumption that you have access to the true model so in this lecture we'll be talking about methods that can learn without needing access to the true model and therefore can learn directly from interaction with the environment in particular we'll be covering two distinct but related cases one is model free prediction where the goal is to estimate values this is called prediction because these values as you will remember are expected returns and returns are something about the future and therefore a value has the semantics of being about the future and this is why we call this prediction will also talk about model free control but we'll limit ourselves to a value based methods for this lecture in later lectures we'll talk about methods that don't necessarily store values explicitly and in molar free control the goal is to optimize the value function and the policy again model free so in an unknown Markov decision process what will not yet cover but will come in later lectures is learning policy directly as I said also how to do with continuous Markov decision processes and deep reinforced learning these last two are related for this lecture mostly will consider the tabular case so we can see that consider basically any state value or any state action value to be stored somewhere in this area so we can just update it directly that's not a strong limitation of the methods but it's just for clarity that we'll consider that case extensively first and this is also what is done in the session umberto book but don't worry we will not lean on this too heavily and will later extend all of this to the case of arbitrary function approximation including deep neural networks ok so the idea is to move towards sampling because we want to get rid of the assumption that we have a true model and therefore we want to use experience samples to learn we call the direct sampling of complete episode trajectories Monte Carlo and Monte Carlo is a model free method we just sample we need no knowledge of use of the Markov decision process and we already saw a simple case of this when we can discuss bandits in the second lecture because there was no Markov decision process there or if there was it was a very simple one with only one state but we were already doing something that is very close to the other methods we'll discuss here we were already sampling we weren't assuming there was a model given we were just going to sample these rewards and we would average them and this will give us good estimates of the action values for each of the actions now there's two ways you could write it down the top one there basically says we define an action value QT which is an estimate of each action as the sample average of the rewards the notation is a bit verbose there but the indicator fun just means we sum over all time steps but we're only actually looking at the time steps on which you took specifically this action so a simpler way to think about that is just look at all the rewards you've seen for this action and you just average those and then the assumption that this is this is a good thing to do is based on the fact that this sample average will just due to the central limit theorem approach the actual average reward which is the thing that we're interested in now equivalently we can write this down as on the bottom where we basically turned it into a step size algorithm where we have a new reward and we just update our estimate a little bit towards that reward if you pick your step size specifically to be 1 over the number of times you've selected this specific action then these 2 things are exactly equivalent but this already points to maybe a more general setting where we could also think about having say a constant step size to deal with non-stationarity or maybe even towards the function approximation cases we'll see later in later lectures especially now one thing to note is that I've changed notation slightly from the bandit lecture this is just to prevent confusion that I mentioned this because conventionally in the mark of decision process literature we denote the reward coming after the action at the next time step because we consider the reward and the next state to arrive at the same time after you've taken the action so we normally say you're in a state s T you take an action a T and then you observe a reward R T plus 1 and the state s T plus 1 when we were first discussing bandits we had an action at time T and a reward at the same time T which is also the convention in the literature just so this is just to know there's like a slight offset there and both unfortunately both notations are common in the literature so it's good just to be aware of that in this case we went to the MDP formulation because we'll be talking about full sequential case and this is just a special case where there's only one state now you can already generalize this slightly just as a somewhat of a teaser where we will be going and also to show that this is as far more general methods than it might seem at first we can consider bandit problems where there's a certain context so what we're essentially doing real yet rolling in the sequential T but we are rolling in in the sense that there's a certain context in which you take your action now we will still consider at first here episode set and after exactly one step so you're in a state you take an action you get a reward but your action doesn't impact the next state in fact the next state maybe just get samples from some independent distribution and there again your goal is to pick an action and so on and we still want to estimate you expect its reward but this is now conditioned on the state and the action now the reason to put this here is to also show that this this is still in some sense a fairly simple algorithm in the more general case even if your state space is really big for instance it's a continuous space maybe it's I don't know pixels or some continuous sensory or motor stream that you can can observe then you could still think about doing something like this but you would maybe more naturally then minimize a loss where you basically want this function to be the same as the rewards and you just have this instantaneous loss down there and then you can consider the total loss Princes to be summed over all samples that you've ever seen and then you just want to minimize that and this is just regression essentially so one way to estimate without using a model is your sample rewards and then estimate by regression the sample average is just the simplest case of regression where we assume we can store these things exactly in a table but the idea is more general now this is like a small step into the function approximation land I will now mostly go back to the tabular case for the rest of the lecture but we'll get back to the function approximation setting in much more depth in later lectures but instead of going there instead we're going to roll back in the sequence ality that we discussed in the previous lecture when we were discussing dynamic programming so we could think of an easy algorithm in some sense to do policy evaluation so the goal here just to be clear is to do policy evaluations who are interested in estimating the value of a certain policy later on I'll talk about how to optimize but for now we're just doing the policy evaluation and one way to do that is to sample a whole trajectory with that policy and that'll get a return no know that normally I have this return trailing off into the future without committing to a certain end points but in order to get the full return I need this determinate at some point so I've basically denoted here there's a time step capital T which is in the future so it's somewhere beyond the small T and this is the time step at which this episode ends and this might be a random thing this is why it's a capital in some cases might be deterministic in the bandit case this is always exactly at the next step for instance but we need to fool returns so we're going to just sample a whole trajectory your whole episode by executing that policy and then we can do the same thing we did for the bad news we can just average and if we have a discrete MVP where there's a limited number of states that you can just observe you could again just store this in a table you could just average per state and potentially per action if you want action values rather than state values in a more general case you could get into regression this is called a Monte Carlo policy evaluation because we use Monte Carlo samples of the full trajectory of the of the return sorry for most cars amsoft return using the fool trajectories and then we use that this in this case a new policy evaluation is it clear so far again always stop me whenever now - just to give you an example that well it's more or less an example that this works in some sense we can consider a case a simple case of a blackjack setting where there's 200 states and these 200 states we're just going to represent separately in a big table what's not that big and the state is composed of how many points you have right now if you have fewer than 12 points you'll just keep on getting cards until you get 12 so you'll always be in a situation where you have 12 to 21 points and the goal here is to have more points than the dealer the way this works with these playing cards is that all the if you don't know blackjack all the face cards are worth 10 the ace is worth either 1 or e less and you can pick whichever is most useful to you and the number cards are all just worth their number and your get deal to these deltas cards from a deck so you can't see what's coming up and the dealer is only showing one of its cards so the dealer shows you a little bit of what it what what what it already has but not exactly so that's the other part of your state what the dealer shows and this can either be an ace or can be anything from two to ten so that gives you a hundred saves and then we separately encode basically the the information of whether you have a useable ace because that's not captured in that number from twelve to twenty one so we need one additional bit of information to tell us whether there's an ace as one of these cards any useable ace means that you could still you're using it now to represent the eleven but you could still pick it to represent one later on if you would so desire which can be useful now this is just the game which is by the way not that important if you missed any of the details or you're confused about the games not that important we're just going to use it as an example and there's just two actions you can stick which means you keep your current score and then it's the dealer's turn and then the dealer will basically deal itself cards according to some policy and then at the end you'll see whether you've won or not alternatively you could draw which means you can't get another card but the risk is you go beyond 21 at which point you lose that's denoted here at the end in the reward for draw is minus one if your sum of cars goes beyond 21 and then the episode ends and otherwise the action is considered to be worth zero and this is some excuse me and a reward here is zero because we're essentially aiming for the win we're encoding each step during the episode as being worth zero because eventually you're going to have to stick if you haven't terminated yet because you've went beyond 21 and when you stick then the dealer will do its thing and if you won so if your sum of cards is higher than the sum of the dealer card you get +1 the same applies if the dealer would go beyond 21 in that case you've also been considered to have won if you have the same number of cards in this case we'll just define it to be zero you could also define that some cases the the dealer wins the tiebreak in that case you would have a minus one there but in this case we should find it to be zero and you get a minus one if the dealer has basically more points that you do now to me it's not immediately obvious what what the values are of a certain policy but you could define a policy and then you could just check by running Monte Carlo and that's what we did or essentially this is this is from the books it's not what I did but what this shows is the whole stage space so the two flat axes are essentially whether how many points you have how many points the dealer has and then it split out into two planes depending on whether you have usable ace or not so there isn't total 200 states there in basically the column on the on the left and there's again two on the states here in the column on the right the difference is that we ran more learning episodes essentially to here on the right and you see it at at some point it becomes very stable because we're just averaging so by the law of large numbers the variance of the estimates goes down and at some point you get a fairly reliable estimate of the of the quality of your policy now this example is in the book so you can go through it again at leisure if you want but the pointer is just to show that you can find these solutions so that it works and that these solutions might also look there's a little bit of interesting structure here that you might not be might not immediately be obvious just by looking at a problem so that was just an example of using Monte Carlo policy evaluation but it's not the only algorithm you can use and it's not always the best algorithm I mean this is a very small problem and we already needed to generate half a million episodes to get very smooth functions so what we'll consider here is something work into the dynamic programming which discussed last lecture the top two equations here are the same as what you've seen last lecture the first one is just the bellman equation that's the definition that's the definition of the value that we want to ask the second one is policy evaluation with dynamic programming which you've seen last lecture which basically just turns this first equation into an update and as we've discussed this will converge to the same values eventually but you need a model to do that you need to be able to roll out this one step now alternatively this is just an expectation that we can sample so a naive thing to do is just to sample this thing and then to plug that in as a new estimate that's typically not such a good idea because this will be noisy there might be noise in terms of the reward there might be noise in terms of the next state and therefore it's not immediately obvious and typically it won't be the case that this is a better estimate than the previous one it might might be just more noisy than the previous one so what's better is to average this target so essentially what we're doing is something very similar to before but instead of using the Montecarlo return we're going to use this one step and then bootstrapping on an estimate value as a stand-in for the actual value there's multiple ways to interpret it one way to interpret it is as an approximation a sample based version of the dynamic programming algorithm that we have right there a different way to interpret this is to see this as a return estimate where we've taken one real step and then we just replace the rest of the samples returned with this estimate that we already had at the next state but because we use a real sample it's a little bit better in a sense than if we didn't do that one step this is called temporal difference learning and the reason essentially is because you can interpret this whole part between your brackets you can interpret as an error between your current value and the targets and it's an error in the sense of a one step temporal difference so this is called a temporal difference error and then this is called temporal difference learning so to recap in approximate dynamic programming we used one step of the model and bootstrapped I put approximate here in brackets because you could apply these same ideas to construct the targets and then to have a parametric function that updates towards the target rather than the full tabular thing that we looked at at the previous slide when I move to Carlo we could just sample the full return we could use that and the idea of TD learning is basically to combine these two things we both sample but we also still bootstrap which means we use an estimate at the next state value now the first target here is an approximation of the real return because we bootstrap we're not guaranteed that the value at iteration K at the previous update essentially we're not guaranteed that this is going to be the same as the expected return from that point onwards that's the only approximation we are using the real model so for the rest there's no sampling error conversely in Monte Carlo there's no approximation error due to bootstrapping we're not using any estimates this is an unbiased sample for the actual value of that policy but we do get sampling error we just get a little bit of noise in some sense now TD learning basically gets both of these we're approximating in both of these ways however turns out there is less noise than a Monte Carlo and you don't need to reliance on the model that you're doing a dynamic programming and this tends to work quite well in practice so this is called using this as a target it's called temporal difference learning and this is a very central idea in reinforcement learning and basically use all over the place so it's important important to understand now this has a couple of properties as I said it's mobile free so you don't require any knowledge of the Markov decision process and it can also learn from incomplete episodes due to the bootstrapping this is something that may have not been immediately obvious but for the Monte Carlo return we needed to sample the whole trajectory we needed the whole return but if your episodes are really really long it might take a really long time for you to learn and in some specific cases you could even have a policy right now the just loops somewhere it like a robot that goes into a corner and bangs its head against the corner because it has a poor poorly learned policy so far and if the episode never really terminates then learning would never happen right as in temporal difference learning you could update after each in every transition which means you can sometimes learn much faster yes sorry yes sorry DM I should have been probably clearer about that earlier on an episode is basically a chunk of experience where at the end of the episode you reach the terminal state we see some examples in the previous lecture but maybe I wasn't as explicit about it as for instance the bukas the book has like a whole section on episodes in the bandit case each each interaction is one episode that that's just a terminology we use it's a very good question so in the grit worlds we've seen so far typically when you walk into a wall we say that the episode continues you just transition back to the same state you were but when you go to the terminal state we say the episode ends and you get transitioned back to whatever your start state was you could also have a start state distribution doesn't always have to be the same state you could also think about in in in practice a lot of experiments say for instance with if you think about a robot arm that tries to grasp something what people tend to do in practice is they put the robot arm in some situation and then it has to execute an episode in which it tries to grasp something and then whether successful or not the experimenter brings back the arm to its initial state and then we start a new episode so there's like a clear division there between one trajectory of experience which is kind of continuous and then there's a disconnect and there's a new episode what I said here is actually slightly more general just it doesn't just apply to episodes but it also means within long episode temporal difference when you can already learn and you can even learn when there's no episodes when there's just a continuing system because you can update from each and every transition okay so there's now kind of an obvious question how did this diode assist Monte Carlo method compared to temporal difference learning what are the properties of these two when should you use one when should you use the other so the goal again is still policy evaluation we just want to evaluate this policy for a certain given policy PI and one way to do that is to sample the return and to update a little bit towards that and the other way is to use a one step rollout where if now you know that this thing is the TD error the temporal difference error and update worse is one step target and to get a little bit of an intuition there's an example here from the book the situation is as follows somebody is driving home from work and the initial state is that this person is leaving the office the goal here is to estimate how long does it take to get home in this case we're not actually optimizing so we don't have to put the time we don't have to say for instance the time is negated so that you if you optimize time you actually minimize the time instead we're just going to predict how many minutes it takes and each time we're just going to measure the amount of minutes already elapsed and this will be a reward between each two states and then we could look at the full reward summed over the whole episode which starts at the office and ends when this person arrives home initially there are some prediction which is at this case 30 minutes maybe this is based on some past experience so far at the beginning of the initial state there hasn't been any reward so there's no there's basically no elapsed time any elapsed time is just the difference between the actual time that it took between each two states now the next state is when this person reaches the car and this has been five minutes which may not have been surprising but when reaching a car you also notice is that it rains and when it rains typically takes a little bit longer so the prediction is updated to it's still 35 minutes now even from the car so the total time total prediction is not 40 minutes from the beginning now a little bit later things went better than expected and we exit the highway the elapsed total time now has been 20 minutes so in some sense there's been a reward along the way a 15 and now we predict it's only more 15 more minutes because the highway went a bit faster than expected say which means the total time has gone down a bit to 35 but when exiting the highway this person finds itself behind the truck which is unexpected so even though 30 minutes have already elapsed it's still 10 more minutes expected to be before arriving home and then finally it even takes slightly longer than that because after 10 minutes we're entering the home street but we're not home yet and we expected three more minutes and this turns out to be correct this is just one sample trajectory where there's been some rewards along the way which in each of these cases are basically the differences in elapsed time in that first column those are the rewards and we see this prediction along the way updated a number of times because of events that happened now we can understand the difference between Monte Carlo and TD from this example by looking at what the updates they would do in the Monte Carlo case on the Left there's this actual outcome of 33 minutes and we would basically update the estimates from each state that we were in towards this now it should have been 30 43 minutes in total so for this we have to take into account what you elapsed time us or essentially we're only updating the values of each of these things towards the difference that we still need to get 2/3 of 43 but each of them gets updated in exactly that way towards that number so that the initial state for instance leaving the office gets updated swartz 43 now now if instead you were doing temporal differencing you get a different picture because in each state we would see the actual reward which is the elapsed time and then we would just update towards the elapsed time plus the value in the next state we're not discounting here for simplicity which in this case means we're updating like here on the right now it might not be immediately obvious which one is better but let me give you a personal example and a goat from last year I was going to one of these lecture halls and I had been there a number of times already but I thought I could take a shortcut so I went somewhere inside a UCL building that I had never been in before got thoroughly lost usually these interesting Maisy London buildings that you are probably familiar with and at some point I basically just gave up and I found my way back to the exit which was the way I came in so I lost a little bit of time there just reversing this maze but when I exited building I knew where I was again so I knew how far it was to get to the lecture hall now if I would have been doing Monte Carlo learning that would have meant that at the point of exiting the building I would have updated my estimate for how long it takes from the building to just this one sample of how long they didn't then take me to get to the lecture hall instead if I would have been doing temporal difference learning I could have bootstraps on this very good estimate I had on exiting so even let's let's assume this didn't happen let's assume that on my way to the lecture hall I may have been interrupted let's say I bump into somebody and I I talk to them for a while if I would have been using Monte Carlo learning this would have been incorporated into the only sample that I've ever used to update my values inside the building that have only been in once which is actually not a good thing to do in this case because it's just a noisy estimate it's much better in this case to boot from the value of the well-known state that you've been in many times and this is something that temporal difference methods can do so there's pros and cons but and there's more than there then and then there are on these slides will discuss more but as I said before one of the important differences is that temporal difference learning can learn before knowing the final outcome and it can also learn without knowing the final outcome in a fully continuing case so in temporal difference learning you would update after every step which means that even if in this if in the same episode you come back to the same state this might have been updated already you might act differently which is a definitely a benefit the other caveat all the way at the bottom that I wanted to just explicitly call out here so the monte carlo and he works for episodic terminating environments which is like this is the hard constraint in some sense but there's a softer version of this that it works better if you have maybe shorter episodes because it just means you get more data if you have very long episodes this means that you'll have very few data points in a sense this can also be understood as there being a bias-variance tradeoff where as in the example that I gave this one roll out from a state can be quite noisy and if you want to update towards that that might be the wrong thing to do and one way to just say that formally said it has a high variance this return now the temporal difference target which may or may not have been obvious is a biased estimate it's not it's not unbiased for the actual value because we're bootstrapping but the variance is much lower because there's only one action one transition that we're depending on to get the sample whereas in the monte carlo case we have all of these transitions and we're using all of them to get our sample and we can understand that by looking at a very simple problem in this case a random walk so the start state here states see in the middle and there's essentially essentially there's two actions go left or rights but you can also understand this as a Markov reward process without any actions where you just randomly transition left or right and then we can still talk the question what is the value of each state we're going to do this tabular so we'll just have a separate value for each of these states and we're going to update these the reward structure is very simple there's a zero reward basically everywhere except if you terminate on the right hand side then you get a reward of one let's say we initialize the values to all be point five at the beginning a half which is a fairly good estimate because on average they will be because half of the time you'll exit with a zero and half of the time you'll exit with the one but for the individual states of course it's wrong except for states see where it is actually you have turns out the actual values are actually I think fro from from from the left of the right is one over the number of states and then two over the number states figure from numerous states no that can't be right because you don't get a half so it must be three over six four see the initial values here are denoted here as a straight line 0 next to the straight line means this is after zero episodes then we've done one episode randomly and this is just so happens to be the case this is episode terminated on the left and what we see is that the value as updated with TV has updated and there's now a slight hinge in the line that points down all the way at left note that the rest of the line is still flat it's not that clear from this picture because there's other lines in here but the rest of the line that corresponds to this one is still flat we haven't updated any of the other states because we're doing TD learning and it will been bootstrapping so the value of state C got a reward of 0 and then went to state B Saye but the value of state B was also 1/2 so the value of 1/2 could update it towards 1/2 still a half so it didn't update there was question ok now we can continue and continue and at some point after 10 episodes in blue we'll see that the line is starting to shift and after a hundred episodes it's pretty close to the diagonal line which denotes the true values in this case it was run with a fixed step sighs and this means we'll never actually converge to the true line if you would decay the step size as you do when you average then it would actually go to all the way to the straight line but now we can do of course the same thing we can run this on the same problem we could run the Montecarlo algorithm and we did there is a different picture in the in the book I decided to run this one myself as well because I wanted to show you what happens when you tune the step size and also have a wider range of step sizes for both of these methods now their shades of color here I don't know how clear they are let seem to you relatively okay they go from lights which is a high step size to dark which is a low step size and you'll notice that in both of the case is the darkest line in black tends to be near the top and the yellow line also tends to be near the top which shows that intermediate step sizes tend to be better this is in general true as long as you have a wide enough range there tends to be a u-shaped form or if you want to optimize rather than minimize it's an inverse at you for what your best step size is in this case we didn't decay the step size we just kept the fixed step size for both of these methods yes yes a good question so how do we actually calculate that point there with the one yeah that's let's step through that it more explicitly so the update here is the TD update so we have a previous value which is 1/2 and we're going to update by adding step size times the targets minus this current value the target minus the value being the TD error and the step size here will speak to be 0.1 which means that we're actually moving exactly 10% towards zero here that's how much it dropped so it must have been 0.45 here because the target in this case will be your reward of zero plus the value of the next states but the value of the terminal state is always defined to be zero because you can never get any more rewards so the target here will be exactly zero if we would have exited at the other end it would have gone 10% up towards one which is well somewhere here I guess because there you would have gotten a reward of one and still the next state would have been worth nothing in later steps will actually update towards values that are more and more different so there will be these intermediate updates as well I'm not I didn't explicitly put the algorithm on the slides so the step size here is it this one it's also described in the book but essentially in this slide we just discuss a number of different step sizes including an orange the one at the bottom they're kind of going down quite quickly that's the step size of point one used for the previous figure and there's a couple of things I want to point out from this from this graph one is that the best TV methods do better than the most best Montecarlo methods for this specific setup but for each of these it holds that the dark dark one is at the top this is the lowest step size but it's very smooth and it tends to go down quite well this is just a step size that tends to be maybe a little bit too small for how long we've run it if we would run it longer eventually it will go down and on down and it would go to very good performance but if you only had 100 episodes in this case you should have used a bigger step size to get there faster but if your step size is too big for instance look at the yellow curve in both of these cases it stabilizes at a higher error because you keep an updating and we're not decaying the step size which means that we get a reducible noise in our updates it is and one thing to note here is that the noise in the Montecarlo returns is bigger than the noise in the temporal difference returns the variance here is higher we have a little bit of bias in a central difference case but the bias reduces over time because our values get more and more accurate which means that some follically it'll actually reach the true values both of these do but the temporal difference case the bias is only transient and the variance is lower which means that for certain step size is like 0.1 it just does a lot better it still doesn't all the way to zero the error for any of these step sizes because we still have the irreducible noise related to the stochastic updates for it for the error to go to completely to zero we would have to decay the step size all the way to zero slowly as well just to be completely clear what's on the curve here this is the root mean squared error over all of the states so if it would be zero that means we have all of the state values estimated exactly exactly right any questions now there's other differences between Monte Carlo and temporal difference rating and this one may have this one especially may not be immediately obvious the bias variance I think is fairly intuitive if you think about it the Monte Carlo return is just noisier so there's a higher variance but it's unbiased but there's also a difference that you can look at when when you consider what happens if you only have a limited number of experiences so in this case we sampled K episodes capital K episodes each of which may have lasted a number of steps so here left to right means we're going through time in each episode and then each step each row to the to the bottom will be another episode and the idea here is that we're going to look at what these algorithms do if we only have a limited set of experience and we repeatedly show the algorithms this experience all the way until they're basically done learning whenever much they can learn from this experience and then turns out the algorithms aren't equivalent even though both of them go to the same place if you would have infinite experience and you appropriately DKR step size if you have a finite amount of experience you learn all that you can using these specific methods they don't find the same answer and to understand that we'll look at a small example in which we have eight episodes and these episodes they're all only ever got the only only ever are in two different states States a or there's no discounting and turns out we only ever were in state a once the first episode was we were in state a we saw a reward of zero we went to state B then we saw a reward of zero and their net worth determination second episode is we started off in state B apparently there was a reward of 1 and it terminates and then there's a lot of similar episodes where sometimes the reward is one sometimes it's zero now the question is what's the value of a what's the value of B does anybody want to answer maybe let's start with the value of B who wants to make a guess yes not 0.75 is is a good estimate here we're just taking the sample average of all the rewards we've ever seen from stay P I think it's hard to do better than that and now the next question is what's the value of state a zero is one suggestion does anybody have a different suggestion I hear some whispers I don't hear any answers so there's two basically valid answers here in a sense one is to say zero because all the returns we've ever seen from state a have resulted in a total return of zero that's as I assumed that was the reasoning behind the answer of zero there's a different thing you could assume each time we were in state a we transition to state B but state B we've estimated to be worth 0.75 because we've just it's basically six or eight but if each of the episodes that ever were in state a always went to state B and we got a reward of zero along the way maybe this should be the same value so I put open seven five and it's not immediately clear which one of these is correct now there's a way to characterize what these two different things are and it turns out Montecarlo and temporal difference methods go to these two different answers the Montecarlo answer will be 0 because Montecarlo will just average all of the returns you've ever seen in this example the only return we've ever seen from say day was 0 so the Montecarlo estimate will be 0 the temporal difference estimate won't be because it turns out temporal difference learning in this batch setting where we have a limited amount of data but we replay that data over and over until you figure at all that we can it turns out the temporal difference learning then converges to the solution of the maximum likelihood Markov model which means it basically finds the solution of the empirical Markov decision process where all of the probability distributions are just the sample averages of the things that happened in this case specifically that means that the probability of ending up is state B from state a is estimated to be one because we've never ever seen an example where a state a didn't go to state B also the reward on that transition is estimated to be zero because that's the only reward we ever saw from A to B in that case that means that the value of state a must be the same as the value state B which is 0.75 so this is kind of interesting these two algorithms if you just run them indefinitely on this data they come come up with different answers now what does that mean in practice which one should you use well the temporal difference learning algorithm you can interpret to be able to exploit the sequentiality and the markov property of the opening environments it essentially builds this empirical model in a sense without explicitly building it but defines the same answer Monte Carlo doesn't exploit this property it basically just says I don't I don't care how this thing is wired up I'm just going to sample from each thing whatever happened in a case I'm not going to use the fact that I know if one state came after the other that these values are somehow related perhaps I'm just going to set each of them as if they're independent now this is a big potential benefit for temporal difference learning especially if you're actually in the Markov decision process later on we'll see that in some cases this is maybe violated a little bit for instance because you do function approximation so even if your state would be fully observable you can't actually tell but it's even clearly in the case where your your your state is partially observable so you can't observe the full environment state in that case maybe it's wrong to assume this Markov property and we'll see that sometimes it's better to go a little bit towards Monte Carlo I say a little bit worse because there's actually algorithms that go in between these which we'll cover in a later lecture so here's a depiction of these algorithms schematically so one way to understand the Montecarlo backup is that there's this big tree of possibilities that might that might happen and we're going to update this value somewhere as a root of a subtree of this maybe bigger tree we're going to update it towards the sample return all the way towards the end in the previous lecture let me do that one first we had dynamic programming which went one step but it looked at the full tree which requires you to have the model to be able to build all of these things now temporal difference learning does the intermediate version where you only take one step but we don't use a model so we only ever have this one step so we sample this but then we do bootstrap and we use a value estimate at each of these states this is akin to using heuristic values when you do search the using of these values along the way we call bootstrapping which involves updating towards an estimate which is not just an estimate because it's noisy but it's an estimate because it's a for instance a parametric function it's estimating the state value now Montecarlo does not bootstrap which is an advantage is in some cases and it is a disadvantage in others and dynamic programming and TTD both bootstrap in terms of the sampling multi-car obviously samples temporal difference learning doesn't sample MTD as set before bootstraps and samples and these are the main distinctions between these methods and you can depict that a little bit like this where exhaustive search is all the way here at the top right we haven't really covered that much but it would be if you don't do the dynamic programming thing but you just build your fool tree if you could for a problem and you search all the weight rolls all the leaves and then you you you back up everything now dynamic dynamic programming doesn't do that it takes one step but it does use the fool model and the other dimension is whether you sample or not so exhaustive search is often not used instead typically we use Monte Carlo methods these days because we're interested in problems with huge search trees in which case you just can't do the exhaustive search for instance the game of Go is a good example of this where there's this huge branching factor in you can't do exhaustive search similarly that in dynamic programming you could view temporal difference learning as the sampled version of dynamic programming or you could view it as the bootstrapping version of Monte Carlo and all of these are valid and turns out there's actually also algorithms in the intermediate here and as a set we'll discuss them later but we'll leave that for future lectures okay so now I want to move towards model free control so what we've discussed so far was all estimation we're doing prediction we're estimating the state values but of course we want to eventually use these things to get a better policy to make you aware or as a warning depending on how you want to interpret it some of the material that we're covering right now is also what you'll need to understand in order to do the second reinforcement learning assignments so that might be useful to know but first we're going to do a refresher which is something we've just talked about last lecture which is the idea of policy iteration the slider says generalize policy iteration is what we what we kind of talks about which is the idea of doing this interleaved process of estimating the value of a policy and then improving that policy it's called generalized policy duration rather than just policy iteration when you basically approximate either or both of these steps so we don't necessarily fully evaluate a policy and we don't necessarily fully improve it but if you do a little bit of both or a little bit of either of these it may be fully the other one then it's a case of generalized policy iteration and the idea there here was that if you follow this process and you keep on improving your policy eventually you'll end up with the optimal policy in the foody tabular case okay so this is just a refresher from what we discussed last time what we're going to apply this idea of policy iteration on the in the setting that we're in right now and first we're going to discuss the multi Carlow case so what we'll do right now is we'll take Monte Carlo as a policy evaluation algorithm and we're going to use that to also improve our policy in case you just came in you haven't really missed anything yes don't worry so remember we just reduce in Monte Carlo we have a return in expectation this is equal to the true value so that we can use this to evaluate a policy and then we can improve our policy however there's a small immediate bump that we hit because what we did before is we want to greenify but we were using one step roll out of your model which we don't have here for CV there's also fairly easy fix if we just estimate action values rather than state values we can just maximize immediately and it's much easier we don't need them all so this points to something that we'll see quite often is that in is that when you do control people much more typically estimate state action values than they do state values because it's much easier to get a policy out of it you just maximize this de the action values so there's no obvious thing to do which is we do use Monte Carlo policy evaluation to estimate the action values we don't need to go all the way because we do generalize policy to iteration we don't need to fully estimate it but we just improve our estimates a little bit with policy evaluation and then maybe we greed if I but that's probably not a good idea because then we're not exploring and this is something that we've discussed in the second lecture at depth you need to balance exploration and exploitation and this is especially true when you do Montecarlo methods because we don't have the full model so if you never go anywhere you won't learn about it you need to explore to learn about the world for to me again there's a fairly easy solution maybe we just explore a little bit for instance with epsilon greedy turns out the generalized policy iteration idea still applies you don't have to go fully greedy it's enough to be greedy or with respect to your current values then you used to be now if you have an epsilon greedy algorithm you estimate its policy and you get a new value out of that if you then become epsilon greedy even with the same epsilon towards this new values it is still Apollo's me policy improvement step and you can still show that this is a valid thing to do there's a short derivation in the book that shows you that this will converge and then that sends to epsilon optimal policies if you keep your optional fixed now if we want to become truly optimal we need to decay the exploration eventually and the acronym Billy is quite often used in a literature which is short for greedy in the limits within with infinite exploration and what this essentially means formally in the tabular case is that we're going to explore in such a way that we're going to try each action in every state infinite times in the indefinite future doesn't say anything about a ratio she might try certain actions way more often than others but in the long run you have to really try everything infinitely long because we're going to average for sampling so you can't have like complete guarantees of optimality without seeing everything indefinitely often infinitely often the this is the infinite exploration part the greedy in a limit means that eventually you're going to be greedy and one a simple example of an algorithm that does that is to do epsilon greedy exploration with an epsilon that decays for instance with 1 over K where K is the number of episodes you've seen you could also use 1 over T or something like that if you don't want to count episodes but rather counting steps and this turns out to be sufficient to get you these properties and this means that algorithms that for instance the Montecarlo policy evaluation thing will eventually converts to the optimal values so the algorithm is then very simple in some sense because we're doing everything fully tabular we're just sample a in episode we're just going to do fully episodic case we're doing Monte Carlo so everything breaks up into episodes also only applies in episodes we're going to increment the counter for each of the actions that we've seen and we're going to update the value for that action towards the return for simplicity I'm not talking about what happens when you are in the same state multiple times in an episode there's more about that in the book for simplicity you could just assume maybe each day's an action in a trajectory is unique then you don't have any issues you just average you bump it to most ones in that episode and then we improve the policy by decaying the epsilon maybe a little bit 1 over K and then we pick a new epsilon greedy policy with respect to our current Q value estimates after this episode now there's a theorem that if you use this the epsilon choice here guarantees that it's CLE greedy in the limit with infinite expiration then the Montecarlo control algorithm depicted here will converge in the limits to the optimal action values which means that if you have those you can easily pick the optimal policy by just acting greedily with respect to those action values this is quite cool we can get the optimal solution but maybe we want to so there's maybe you select catch here in the sense of the normal Monte Carlo catch that this can take a while maybe you need many episodes maybe the episodes are long maybe there's a lot of noise so maybe we want to use temporal difference learning again it has a lower variance you can learn online you can learn from incomplete sequences you could also learn if there really are no episodes if there's just this one continuing stream of data life so a natural idea is to use temporal difference learning instead of Monte Carlo for control which is just to apply temporal difference learning to this queue estimate and maybe if you use epsilon greedy whenever whenever you are in a certain stage you just check your current estimates and you're absolutely agree with respect to that which you might update your estimate every time step there's a natural extension of the TD algorithm that we already saw for state values which just applies exactly the same idea to state action pairs so we have a state action pair at the top we have a transition we see a reward in the next states and then we also consider the next action according to our current policy whatever that is and this gives us an update that looks a lot like the app that we had before for state values in fact it's exactly the same except that we replaced each occurrence of the state with the state action pair this is called because of delay the way these letters spell out that word sarsa state action rewards state action sarsa yes so yeah let's clarify epsilon greedy again so the question was how does this explore in different directions if you use something like epsilon greedy so epsilon greedy is a very basic exploration method and just to recap what it does it picks the greedy action with a probability of 1 minus Epsilon and it picks uniformly random across all of the other actions with probability Epsilon so it'll it indefinitely try everything from every state but it does it in a way that is sometimes called jittering it doesn't like direct that go somewhere it just tries an action and then maybe tries the opposite actually the next step this could happen in some sense this is maybe the simplest exploration algorithm that you could consider okay so we have two sarsa algorithm now with a step size make small steps because we're sampling and then we could do generalize policy iteration again where we use Saoirse as a TV method to improve our estimates on each time step a little bit we're not doing it on an episode by episode basis now we're doing an Ono time step by time step basis and then we just continue to explore with epsilon greedy with an upstream green policy which means we're doing epsilon greedy policy improvements each time the values change we immediately adapt our policy to be absolute greedy with respect to that policy which is an improvement step over the value sorry the policy we're using before this is the fool algorithm we just initialize some Q values in this case there are capitals because you could also interpret them to be just random variables if they're stored in a table this is the convention used in the book I mostly use the convention that the Q values are all small letters because it's a function even if it's a table but for the tabular case you could basically use either in a sense and the algorithm just works by for each episode initializing some states from some states distribution this is basically part of the MVP that picks some star state for you then you select an action according to the policy derived from your current estimates for instance epsilon greedy and then you repeatedly update these values by taking a new action observing a reward in the next state also already choosing a next action in that next state and now you can update and you assign the state and the action in order to make the loop well-defined so you basically call SS prime e you call a a prime or the other way around sorry and then you can repeat this loop so the action is chosen but then actually executes it later in a sense we already picket we commit to it now what this algorithm does it learns on policy and this is an important distinction that we're going to make right now on policy learning means we're learning the value of the current policy this is the typical case in policy iteration where you want to estimate the value of the current policy and then basically in a separate step you improve maybe that policy in the case of SAR so these steps are interleaved at a very fine-grain on the step by step basis but we're still on each step learning about a certain policy in fact you can and people who have many times you can use sarsa to just do policy evaluation to have a certain policy just act according according to it but maybe for some reason you're interested in the policy sorry in the value of that policy and maybe also for some reason you're interested in the action values of those policy rather than the state value but maybe not necessarily to do control you can do that and then you would learn the value of that policy the opposite of this or like the the the inverse of states in some sense is off policy learning off policy learning means learning about any policy other than the one you're following so there's now new notation here we're still using PI to denote the policy that we're learning about which is in both of this case a case is the same but there's a new notation B which is our behavior policy which may or may not be the same as the target policy PI I'll get back to that but before I do let's go back to the dynamic programming ideas from previous lecture sorry there's a there's a slight mistake on the slide I'll fix that before I upload the PI here should be okay and the star here should be okay because these are the the updates not the definitions so the first one is just policy evaluation with dynamic programming the second one is is what we call value iteration which as we discussed in the previous lecture is policy iteration where you each time greed if I after doing exactly one update to all of the state values potentially and the bottom two are just the action value equivalence versus versions of the first two or in this case I 14 he did denote them correctly with the bootstrapping on the value at the previous iteration now there's analogous TD algorithms the first one we discussed where again this should be a T and again in the action value case is correct and we can now see that we already discussed two of these we can discuss TD learning I discussed sarsa by the way TD learning is often used to refer to all of these but people also use TD to only specifically refer to the top one where we just sample as states so we have a step size to be able to smooth out the noise and we don't need a model but we could also sample based give you Moulton one at the bottom here which is value iteration which is trying to do something else these two are evaluating a policy this one corresponds to TD learning this one corresponds to sarsa but what is this one n correspond to and it's an algorithm called Q learning and before I talk more about cooling note that there were four equations here's here's the fourth one there's no trifle trivial analogous version of that one and someone say why I'll go back to the previous slide because that might make it clearer so the only version or the only one we don't have a version of is this one now what's different for this one compared to the other ones yes you can't sample it because the max is on the outside of the expectation that's the main difficulty now obviously there is ways to they may be approximate that but it's less trivial this is why I said in the next slide there's no trivial way to extend it all the other ones you can directly sample their expectations of something don't particularly care maybe what we can just sample this one's harder because there is a max outside of your expectation in fact if you would sample this thing on the inside you can't do the max of reactions anymore because you only have it for one action ok now as I said off policy learning is about evaluating some target policy while following a different behavior policy and this is important because sometimes we want to learn from data that exists already but we want to learn what would happen if we would execute a different policy for instance we might have data by observing say humans or other agents they would generate some trajectories of data but we might not necessarily be interested in evaluating what they did we might be interesting to learn what would happen if we would have done something different and this is possible if you do off policy learning also you could reuse experience from all policies if you do off policy learning let's say you store these experiences you could replay them and then you could still learn from them if you can learn off policy because your current policy might be different the thing we're going to talk about next is you can learn about the optimal policy while performing an exploratory policy and in addition you can actually learn about multiple policies if you want while following one now Q learning can now be understand stood as estimating the value of the greedy policy because there's this max operator here note that this thing was conditioned in the value iteration case on the state and an action SC and a tea which means that the reward in the next state or basically already conditional on that action so it doesn't really matter which policy you are following a state SD because you will just update the appropriate action value anyway but at the next state we need to consider a policy and in this case we consider the greedy policy which means that this will learn about the greedy policy even if you don't follow it and that's very nice because it means that q-learning will convert to the optimal state action value func function as long as we basically explore indefinitely but we no longer have to be Li we no longer have to become greedy in the limit in fact we don't need to explore at all in some sense if we just have a data set in which all of the states and actions have tried infinitely often which is maybe generated while you sample from it it doesn't really matter how it was generated as long as everything's in there often enough Q learning can find you optimal action value functions so in some sense we're decoupling the thing we're learning about which is the optimal value from the thing that we're doing which is the behavior now this is a practical example for what that means also from the book where there's a simple great world set up and it works as follows you start in the state s there at the bottom left and you can just do your typical great world each actions around but whenever you fall into this region denoted the cliff your episode will terminate you will start a new episode in state s again but you will get a reward of minus 100 because it hurts you don't want to do that and on each of the other steps you get a reward of minus 1 which means you do want to terminate you do want your episode to end but you just don't want it to end in the clip you want it to end at the goal cuz then they're hurting stops now there's an optimal policy here which is just a step up once then walk all the way across the cliff and then go down into the goal and in fact this is what q-learning will learn it will learn the values so that if you are greedy with respect to those values it will follow exactly that path but now let's assume that we're not actually following that greedy policy instead where you're following the exploratory policy an Epsilon greedy policy with a certain Epsilon I think in this case it was 0.1 in that case q-learning will still learn that this is the best thing to do and with a probability of 1 minus epsilon for instance it will try to move across on each step with one minus Epsilon try to move across the cliff but on each step along the edge there is an epsilon probability of selecting a random action and one of these actions is down which means it might just fall off the cliff the algorithm is in some sense blissfully unaware it doesn't necessarily care because it's it's it's goal is to opt to to find the value of the greedy policy so whenever you select a non greedy action it doesn't particularly care will update the value of that action and it'll note oh that's a bad idea but it's estimating the values for if you wouldn't be doing that conversely sarsa is an on policy algorithm and it's actually estimating the value of the actual policy you're following including the absolute exploration now turns out if you then run that algorithm and then at the end your friend you look at what's the greedy policy with respect to the action values that Saoirse found it turns out to go up a little further and then walk all the way towards the goal and then go down again basically leaving a safety buffer between itself and the cliff this is because the learning algorithm is aware of the exploration that's happening while it's learning the action values capture the fact that you might take an absolute action at some time steps down and therefore the action values will be lower near the cliff so the greedy policy learned by Saur Saur Saur in the end we'll walk further away from the cliff sometimes this is a good thing sometimes this is a bad thing sometimes you want to find it the policy that traverses is very narrow path sometimes you want the policy that's more robust to the exploration that might be happening but this is a clear distinction the graph here at the bottom shows the reward per episode while you're exploring while you're doing the epsilon greedy policy and q-learning is notably worse in that case however if you just evaluate the policy that they found in the end by taking the action values at the end of learning and then doing the really policy with with respect to those q-learning will be better because it's found a shorter path so depends what you're after whether you find one to find the optimal policy or whether you want to find a policy that's robust to the exploration that you're using know also that in some cases q-learning might not reach the goal that often in this case because it tends to fall in the cliff a lot which also might hurt learning or learning speed in some cases because it might just take you very long to get somewhere because the algorithm doesn't take the safe route so these things might also impact learning dynamics okay we're going to shift just again so feel free to stop me if you have questions about q-learning and sarsa also of course as always if you have questions later please let us know for instance for instance on Moodle others might have the same questions and it's very helpful to get these questions as well because it makes me understand what it's and what isn't clear and maybe I can then be clearer next time around I want to talk about an issue with classical q-learning and by doing so I wanted to make it a little bit more explicit what key learning is doing we're bootstrapping on the maximum action in the next state I pulled out this term from the Pewter learning algorithm and I basically just expanded it here didn't do anything else this is an equality I just expanded it into saying what does this actually mean well what this means is we're actually evaluating with our current estimate in the state at t plus one we're evaluating the greedy policy that's what this part means in that state we're taking the maximal action maximally valued action according to the same estimates in that next state that's our policy that we're considering this is our target policy not necessarily our behavior policy but this means we're using the same values QT the same estimates to both select and evaluate in action but these values are approximate they're noisy because we sample to learn them we're also bootstrapping we might be doing function approximation later on there's many reasons why these values might be a little bit off but if there are approximate then actually this means that in general are more likely to select overestimated values with this arc max then we are to select underestimated values so if we're more likely to select an overestimated value and then we use the same values to evaluate that action it'll look good one way to think about this is if you have many many different values just abstract away States actually than everything else just think about having 10 different estimates for 10 different cool Gaussian distributions some of these estimates will be low some of these will be high let's say the actual mean is zero in all cases if you just have finite sample averages for each of them some will be a little bit below zero some will be a little bit above zero zero but if you take the max it will typically be above zero even though the actual mean is zero that's what's happening here and this causes an upward bias now there's a way around that which is to be couple the selection of the action from the evaluation they're off one way to do that is to store to action value functions let's say Q and Q prime and if you do that then we can use one to select in this case Q to select an action and we use the other one Q prime to evaluate it what this is essentially doing is we're saying we're using the policy that is greedy with respect to this one Q function Q and we're evaluating with the other Q function Q Prime or we could do TR the inverse of this where we pick the policy the policy according to the other one but each time we select according to one and we evaluate the other and then what we could do to make sure these things are actually distinct estimates of the same thing is to randomly pick one whenever we do an update we just update one of these action values for each experience so they have disjoint sets of experiments experiences that they're learning from this is very akin to cross-validation where we basically split up the data into two folds in a sense and the overestimation bias akin to the problem of validation is trying to solve where we're trying to solve the over-optimistic guess that we might get if you would just validate yourself on your train set it's a good question do you need to go double q-learning do you need to do triple Q learning many more fold Q learning I have investigated this in the past I looked at things like 5 fold and 10 fold Q learning and this is a couple of years ago so I forget what the exact results are I think somebody else followed up on this as well with a different paper and there might be a trade of there there's different ways also to do this because you could imagine if you have many folds do you use many to select and then want to evaluate or the other way around do you use maybe one to select and many to evaluate it's not immediately clear which one is better so there's a couple of design choices there so in short the answer is you can definitely do that whether it's better or worse is unclear and maybe it also depends on the domain so can you do something similar where you just basically have two Q learning algorithms each in a separate simulation is that the question and the answer is no you have to have this interleaving where they know about each other if you would just secure learning into separate simulations you would get to over estimated estimates so you need to do this where you take the max essentially good questions thanks by the way just one thing to note we're splitting up the experience to learn about these two state action values but that doesn't mean we're necessarily less efficient in terms of how good our estimates are because when we want to act say how good our policy is when we want to act we can just combine them at the bottom Mara said you can add them together that's okay for acting if you want a good prediction of course you should just average them rather than adding them together but if you're only interested in picking an action according to them you can just add them together so behavior here can be different from what we're using here to estimate we're estimating two different policies one is greedy with respect to one then one is greedy with respect to the other action value what we can act according to a combination of these and that's perfectly fine because we're doing off policy learning now does this work is this a good idea so this is an example I guess I could have just put this one thing I used to have this slide first your one and then the other but it's a bit of redundancy now this is in a rillette case there's a single state there's a lot of actions that basically go back to that state because you're at a roulette table is to setting the aliens at a roulette table there's many different gambling actions there's like 170 approximately and in actuality all of these gambling actions have unexpected slightly negative reward but they're very noisy each time you can continue betting for simplicity let's say there's no state so there's no you can never go bankrupt but maybe you can leave the table maybe there's also an action that just quits which would terminate your episode and this is the optimal action to take because you would stop gambling you losing money now what happens if you run q-learning your estimation is so large that if you do a synchronous q-learning update on all of the actions at the same time and you do a hundred thousand of these so in total you'll have done 70 million updates to a tabular setting with just one state there's no for estimation of more than $20 even though we're betting a single dollar on each step which means the expected return is actually like negative five cents so more than twenty is way off is unrealistically far off you very rarely get more than $20 from betting a single dollar now of course this is the long-term return I think the discount is probably point nine or something like that or 0.99 I don't I don't remember but still it's it's unrealistically high and it's also positive which is wrong because it means that the optimal policy is to leave the table but this algorithm will estimate that it's much much better to stay at the table and keep on betting conversion if you use the double Q learning algorithm from the previous slide you'll actually find that it very nice to estimate something very very close to the actual value and it would also learn to leave the table you can make this more extreme by even paying the agent for instance $10 to leave the table q-learning would still keep on gambling w learning very quickly check take the money and run this is a slide slightly contrived example because there's only one state and it's deliberately very noisy just to show this so you could question does this actually happen a lot in practice so one thing we did is there was a version of Q learning that was used for the initial results on Atari with dqn this got got some scores on games which were quite good and people were fairly excited about this but then turns out if you just plug in double Q learning into that whole system this whole agent with lots of bells and whistles but you just replace the Q learning up side update inside there with double Q learning you get the blue scores here compared to the red scores that DQ angle and indeed not showing that here but you could show that in many of the games there were very clear and realistic over estimations of the values so this does happen in practice interestingly the Atari games that this is run on are deterministic so in that case the over estimations didn't come from the noise in the environment they came from your approximation errors and the noise in the policy that updates these your values would still be approximate and still pick taking a Mexico over these would calls offer estimations the difference is here were much bigger than I expected them to be by the way so why are there some games in which it does worse a lot of this is just random and in some cases there's also there so there's maybe slightly more precise answer as well in some some cases like these things will go up and down a little bit if you run it again because these take a lot of or not as much details anymore but it used to take quite a bit of compute and time therefore to run so we couldn't run that many repetitions of the experiment but the other explanation is also that these algorithms will actually have different properties and in some cases the offer estimations might actually be helpful in terms of the policy that they make you follow they might steer you towards certain places that just happen to be good for that specific game so even though in general it's bad that doesn't mean it's always bad it could be good and that's what happens here maybe on tennis we see this by the way much more often especially in deep reinforcement learning here deep networks were combined reinforcement learning what very often happens is that we have if we have an improvement in terms of algorithm many many games get better and if you get worse this is generally the case because these games are fairly specific and some games will just basically have worked well for other reasons than the algorithm being good or it they may have exploited in sometimes a flaw in the algorithm to still get better policies in the end so in some sense the way to look at these things is not to look too much at score for each game but tomorrow look at the general trend yes very good question why don't why do the offer estimation stick why don't we unlearn them actually we do unlearn q-learning in this case this is a tabular set up is guaranteed to find the optimal cue values what I didn't say is how quickly this line that seems to be plateauing here still going down and eventually this was run with a decaying step size and eventually it would find the optimal solution but if I gave it a 70 million updates already and it's this far along that's probably a problem so the answer is yes cue learning is sound theoretically and it will find the optimal policy there's just ways you can improve this in in practice by the way double cue learning is guaranteed under the same conditions as cue learning to also find the optimal policy in this case it just goes there much faster yes so the question is essentially what's the what's the convergence rate of q-learning and how can you reason about that there's definitely been some work on this mostly for the tabular case in which people have shown basically how quickly q-learning converges and this depends on many things as you might expect but the most important things are your your behavior policy also the structure of the MEP whether the MEP itself is connected in a sense that you can easily cover all of it and also very importantly on the step size that you pick because you come to because of the bootstrapping you count to the flat average thing that we were doing from Montecarlo you need to do something else because the thing we're at we're we're the thing we're using to estimate our optimal value is changing because of two reasons one is the the values of bootstrapping on and the other one is the policy that we're following which is kind of covered with the exploration but if you you can take account of all of these things and then you can derive basically rates at which these things converge and that's all ready for the tabular case quite complex for the especially for the deeper L case basically nobody knows very good question thanks by the way I'm I'll be a very happy to give concrete pointers to papers for people who are interested in these sort of things okay so now we can yes we can quickly still cover the final topic of this lecture and for this we're going to this is still in context of off policy learning we want to learn about a different policy than the one we're following I gave you an example of an algorithm that does that q-learning estimates the value of the current greedy policy while following maybe something else but we maybe want to do it more generally and another way to think about what I'm going to explain next is a different way to come up to come up with the Q learning algorithm but will also find actually more general algorithm than that so this is a very generic statement here we want to estimate a certain certain quantity X here is basically random I probably should have made it a capital and it's it's sampled according to some some distribution D now we can just write this out you could turn it into an integral if you prefer in this case I just made it a sum so there's a finite number of elements X and each of them have probability D of occurring now the expectation can then simply be written out as this sum the weighted sum of FX for each X with the probability that it might occur what we're actually interested in is something else we're following a different distribution we're sampling sorry from a different distribution D Prime and we want to still estimate this thing but using the samples from the other distribution and for this we can use something called important sampling which relates the distribution and one way to see that is by simply multiplying and dividing by this D prime X noting that this is again an expectation but now it's an expectation with respect to the distribution D prime which means we can write it again in expectation notation but the fraction remains make sure that you follow it that part so what this means actually let me go to the next slide first because first I'll just apply it to the reinforced learning case this is exactly the same as we had before I just plucked in some some familiar quantities so now we're trying to estimate the expected reward say let's just take the one step reward for simplicity bandits say and we want to estimate this quantity the expectation under a certain target policy PI now we can write that out as the weighted sum of the expected rewards given a state in action these you can compute if you have the MVP from the distribution of the rewards but this is just the definition we say this is the expected reward condition on SMT and then weighted by the probability of selecting that action a we can do the same trick multiplied and divided by the probability of the behavior selecting that action which is again an expectation so we can write it out as an expectation which means we're multiplying the reward with the probability of selecting the action according to the target policy divided by the probability of selecting it under our behavior this means that we can sample this thing at the bottom and get an unbiased estimate for the thing at the top so that's what we'll do we're following a certain behavior be we'll sample this quantity target policy divided by behavior policy times the reward and it'll be an unbiased sample for the expected reward under the target policy now note as a technicality but it's an important one that B of course can't be 0 for this to be valid which has an intuitive explanation you can't learn about behavior that you never do so if you want to learn about the target policy that selects certain actions you have to at least have a nonzero probability of selecting those actions yourself as well and then everything is well-defined a different way to say that is that the support of the behavior distribution needs to cover at least the support of the target distribution now we can take this exact same idea and we can apply that to the sequential case and it turns out there's more that you can be said about this but I'm just going to skip to the conclusion which is it just means that you have to multiply all of these things together for all of the actions that you did along the trajectory let's say we had a trajectory up to time Big T then this product goes up to Big T as before so this is for a full episode we're doing Monte Carlo we're just multiplying together all of these ratios and this turns out to exactly appropriately wait the return to make it an unbiased estimate of the return under your target policy so then we can just update towards that using our standard Monte Carlo algorithm we update towards this target there's no bias this is an unbiased estimate for the true value of the target policy however it can dramatically increase variance now what is a very simple way to see that well let's assume that our target policy is actually a deterministic policy that deterministically selects a certain action in each state and let's say our behavior policy is more random it explores for instance if your episodes are long that means that one of these pies is very likely to be zero which means your whole return will be replaced with a zero however there is a couple of episodes in which the return will be nonzero but the probability of selecting exactly all of those actions may have been smaller than one for each of these if you're apps if your behavior prints is Epsilon greedy then this be here at the bottom is at most one minus Epsilon which means if you have a product of many of these we're dividing one by one minus epsilon which is a quantity higher than one and this is in the best case in some cases it will be epsilon so we'll have 1 divided by epsilon which is even higher which means this full product will be very big what essentially is happening here is we're down weighing trajectories that the target policy would never do all the way to zero in this case and we're up weighing the trajectories that the target policy would do which are maybe very rare for the full trajectories and turns out we're rewriting them in exactly such a way that on average we get the right answer but if there's only a handful of episodes that ever really follow the full target policy this is going to be very noisy if your episodes are very long because you're going to be very noisy another way to say that has very high variance it's that clear so essentially what I'm saying is maybe very insistance sorry maybe bias is not the only thing to care about even though it's unbiased might not be ideal there's maybe a now obvious thing you could do which is to not use the full return instead let's use news TV in that case we only need to re wait with this one important sampling ratio if we estimate state values because we only need to re wait for the probability of selecting this one action according to the behavior policy rather than the target policy this is a much lower variance because we don't have this big product of things that can be higher than one we only have one of them also means you can learn even if the policies are only similar for one step rather than needing trajectories in which you just happen to pick the same action as the target policy on each and every step of them so whenever you have a transition here within an episode in which these actions match you can already start learning from that so this is a much more efficient algorithm often now you can apply that same idea to action values but turns out it's even simpler because we don't need too important sample if we do the one step the important sampling is so important to try to wrap your head about rounds because it'll come back essentially when you do multiple steps in some cases but for the one step we can essentially just be can be his pick an action according to the behavior policy but we're not just going to ignore that and instead we're going to re-weigh the action values in the next state using the target policy this quantity here one way to understand it let's assume for a moment that Q is a good estimate for the policy of the target policy if that were the case if you then would do a weighted sum according to the target policy this quantity as a whole would indeed also be a good estimate for the state value of that policy if that is the case then Q would remain a good estimate for the target policy so in some sense this is sound but you can show that this is the X exactly also the right thing to do that was the more intuitive algorithm and this algorithm is called expected salsa for the reason that you can also interpret this as being the expected update that salsa does which was just a non policy to the algorithm if you take the expectation with respect to the target policy of the action that you select in the next state this defines the expectation because we're already conditioning on the state in the actions they have the same noise sarsa and expected stars in terms of the next state and the reward but we're already conditioning on the state in the action so we don't need to correct for the policy for this state for the policy for action 80 because we're already conditioned on that action we're only ever updating the value of that action whenever we select that one and turns out expected salsa lazy has the exact same bias as sarsa bias only comes from the fact that we have an approximation Q here the typical TD bias but it's a fully on policy algorithm so as long as the target policy matches your behavior policy but it's a little bit more general than that because if your behavior policy is different from your target policy which is the case when B is not equal to PI then we can call this generalized Q learning and a special case of this is then when this target policy is deterministic so then you pick out one and more specifically if it's greedy you get exactly Q on your Mac so this generalizes both in some sense both salsa which is the sampling based version of this as long as your behavior and your target are the same it's fully on policy and it generalizes q-learning where we're trying to estimate a different policy in this case the greedy policy now this is just taking those steps towards q-learning making that explicit we pick our target policy to be greedy and then turns out these expected updates the expected soursop it just becomes curing as I said on the previous slide and again q-learning you can just use whatever behavior policy you want and it'll still estimate the value for the target policy which in this case is the greedy policy with respect to your current values as discussed before so now we kind of came from important sampling we went and we got the Q learning back whereas previously we just sampled a value iteration update from dynamic programming and we also got Q learning and it's the same algorithm in both cases but you can arrive it at a different ways one is basically coming from the Montecarlo side and the other one's coming from from the dynamic programming side okay so you have a little bit of time left for questions if people have them please feel free to ask because if you do have a question you're very likely not the only one with that question okay in other words before I end I just wanted to mention that there will be a new reinforcement learning assignment which will come out over the weekend and then you'll have a couple of weeks to or Monday I think Monday it says in schedule and then you'll have a couple of weeks to do all of these assignments including the reading week I think there will be a little bit of a while before we get back to the reinforcement learning lectures next week there will be two deep learning lectures and I think then those reading week and then the week after that we'll return and we'll talk about policy gradients and deeper ill and everything thank you\n"