RL Course by David Silver - Lecture 5 - Model Free Control

The Behavior Policy and its Role in Reinforcement Learning

In reinforcement learning, an agent interacts with an environment to maximize a cumulative reward. The behavior policy is crucial in this process as it determines the actions taken by the agent at each time step. The behavior policy can be a stochastic or deterministic function that maps states to actions, depending on the type of reinforcement learning algorithm used.

The Behavior Policy and its Update

In many reinforcement learning algorithms, the behavior policy needs to be updated to improve the performance of the agent. This update is typically done using a process called bootstrapping, where the behavior policy takes into account not only the current action but also the potential next actions that can be taken from each state. By considering these alternate actions, the behavior policy can learn to make more informed decisions.

The Q Learning Algorithm

One of the most well-known reinforcement learning algorithms is the Q-learning algorithm. In this case, the target policy is a greedy policy, which means that it always chooses the action with the highest expected return given the current state. The behavior policy, on the other hand, explores the environment to gather more information and improve its decision-making.

The Q Learning Algorithm: A Special Case

In the Q-learning algorithm, both the behavior policy and the target policy are improved at every step. The target policy is made greedy with respect to the value function, which means that it chooses the action with the highest expected return given the current state. The behavior policy, on the other hand, follows a stochastic strategy that allows exploration while still making progress towards the optimal solution.

The Well-Known Q Learning Algorithm: Sasa Max Updates

In the Q-learning algorithm, the target policy is updated using the sasa max updates, which are essentially the maximum of the expected returns over all possible actions in each state. This process can be visualized as a series of steps, where at each step, the agent evaluates its current state and chooses the action with the highest expected return.

Relationships between Dynamic Programming and Reinforcement Learning

Reinforcement learning algorithms can often be related to dynamic programming algorithms used in optimal control theory. The Bellman expectation equation is one such equation that can be used to evaluate the value function of a policy. In dynamic programming, this equation is used to compute the optimal value function, which can then be used to improve the policy.

The Q Value Function and Dynamic Programming

When using the Bellman expectation equation for the Q value function, we can use dynamic programming to evaluate the current policy or sample from the environment to gather more information. This sampling process gives rise to TD learning algorithms, such as Temporal Difference (TD) learning.

Generalized Policy Iteration and Bellman Expectations

In generalized policy iteration, we can use the Bellman expectation equation to evaluate our Q values. By plugging in these values into a value iteration framework or policy iteration framework, we can iteratively improve our policies until they converge to optimal solutions.

Bellman Optimality Equations and TD Learning

TD learning algorithms are essentially samples of the Bellman optimality equations used in dynamic programming. These equations define the relationship between the state value function and action value function, which is critical in reinforcement learning. By sampling from these expectations, we can create Q-learning algorithms that converge to optimal solutions.

Function Approximation: Scaling Up Reinforcement Learning

As reinforcement learning problems grow in size and complexity, we need more efficient algorithms for scaling up our solutions. One approach is to use function approximation techniques, such as neural networks or look-up tables, to represent the value function or action-value function. These approximations can help us learn faster and more accurately, making it possible to apply reinforcement learning to larger and more practical problems.

The relationship between dynamic programming and reinforcement learning algorithms is complex but fascinating. By understanding these relationships, we can create more efficient and effective algorithms for solving challenging problems in this field.

"WEBVTTKind: captionsLanguage: enin some sense you know everything in the course up to this point has been leading to this lecture okay we're going to finally find out how if you drop your robot or agent into some unknown environment and you don't tell it anything about how that environment Works how can it figure out the right thing to do how can it maximize its reward in that environment how can it figure out the actions that lead to the most possible future reward um without telling it anything in advance about how this environment Works um so that's the goal of this class um model free control and after we've seen this class in the subsequent lectures we'll look at how to scale up how to get to more realistic large scale domains but the the core techniques are essentially building on what we saw in the last lecture and bringing those into a control setting now so roughly the proceedings for today we'll have a brief introduction and then what we're going to look at is two different paradigms for doing control on policy and off policy um which we'll understand the distinction between those um essentially on policy means learning sort of on the job um whilst you're following the behavior that you're learning about and off policy means learning whilst following someone else's behavior um and so we'll start with on policy which is a simpler case and again just like last lecture we're going to follow the format of starting with Monte Carlo control before we move on to temporal difference learning to get more efficient lower variance techniques um so how are we going to do that well um last lecture we basically built up the toolkit that we're going to need um last lecture we saw how to do model free prediction how to evaluate a given policy if you drop your robot into a given mdp and you ask it how much reward will it get if it follows a particular Behavior last time we saw how to do that using Monti Carlo evaluation and and temporal difference learning um and that just helped us to estimate the value function of some unknown mdp and now that basic toolkit is what we're going to use to now do control methods where we want to not just estimate the value function but actually optimize it find V Star find qar find the best possible U Behavior find the most reward you can extract from this MVP but we're going use the same tools and we're just going to iterate them and essentially use what we had before kind of as an inner loop for getting to the best possible behavior um so just to get you back in the flavor I know we've had a um a week off reading week so so I think it's helpful just to get um a reminder of why this is interesting you know why should we care why should we care about this particular flavor why why this problem of unknown mdp reinforcement learning and the answer is that in most interesting problems and I put up just a few interesting problems here that that you know fall under this umbrella so you know controlling an elevator getting some car to park itself um trying to automatically steer some ship or control a bioreactor or um navigate a helicopter or do the logistics for an airplane or play RoboCop soccer or control a bot in Quake or manage a financial portfolio or learn how to fold proteins or figure out how to make a robot walk by itself or to play a complicated game um all of these problems um have mdps there are mdp descriptions of these problems in each case there's some complicated environment in which this problem can be described where there are some underlying Dynamics and some environment that has some state that um tells you how it's going to move around in the state but either um that thing is unknown to us we don't know what that environment is and so we have to resort to model free control to figure out how to do this um or the mdp might be known but it just might be so complicated um that it's actually too big to use except to sample it and if you can only sample it you're essentially back to the same case as model fre control where where you really the only way you can access this model is by just trying things trial an error and see what happens because you can never do even one step of integrating over some vastly complicated model might be too expensive um you know the the real world's an example where you know the the underlying dynamics of the real world are extremely complicated even if we knew them you wouldn't really want to work at the atomic level and you know integrate over the the Dynamics of every atom moving around you would just take samples see what happens and work with that so model free control helps us to solve problems like these so that's why this lecture is interesting I hope okay so we're going to look at this basic um dichotomy between on policy learning and off policy learning so let's just understand what that means now so on policy learning I like to call it learning on the job so it's like you've got some policy you follow that policy and while you're following that policy you learn about it um so basically you actually a sampling experience by sampling actions from some policy pi and at the same time you're trying to evaluate that same policy that's what we mean by on policy planning that the actions that you take um determine the policy that you're trying to evaluate but that's not the only choice we can also do off policy learning which you can kind of think of as looking over someone's shoulder so you might have um you know some robot which is watching some other robot walk around and it's trying to figure out the optimal Behavior by just observing some other robot or maybe even a human behave so so maybe a human gives some example demonstrations um maybe some teleoperated arm and the human says this is how you should move this arm around um and now just from the trajectories that have been sampled from the human behavior um we could actually do off policy learning and we could learn to evaluate um the robot could start to learn for itself well what would happen if I did something else you know maybe there's some better Behavior and the robot can learn about that better behavior from samples of experience drawn from say the human behavior you don't have to learn about the behavior that you sample you can learn about other Behavior that's called off policy learning okay so we'll start with a simpler case on policy learning um and the main idea that we're going to use throughout this course is the same idea that we saw um a couple of um Leo when we're doing dynamic programming um and it's the basic framework of generalized policy iteration so just a quick refresh the slide so the main idea in um policy iteration is to alternate between two different processes where first of all we evaluate a policy um and then we improve that policy and we have this iterative process which I kind of um sketched up on this diagram here where you start with some value function and some policy and every step every time you go up um you basically evaluate your current policy to get you a new value function and every time you go down you act say greedily um with respect to that value function just greedy and so this alternating process we saw when we were dynamic programming actually does Converge on the optimal policy and the optimal value function um and it's just this cycle that can go round and round and round and the question is well what can we slot in for these two processes and what we're going to do today is we're going to vary the choices of what slots into these two pieces here um so to enable us to do model free control rather than in dynamic programming where everything was using full knowledge of the of the mdp model um so so far we've only seen in the dynamic programming examples of these where we our up arrows were some way to evaluate the policy and we saw things like iterative policy evaluation and the down arrows were things like greedy policy Improvement and now we're going to vary those two components and try other choices um so so let's make a first try okay let's say you know we want to do the simplest case let's start with Monte Carlo learning um so we're doing basically um Monte Carlo control that's this section we're in that chapter of the um of the lectures um and and so we're going to start with Monti Carlo control later we'll move on to TD control now if we're doing Monti Carlo control is it possible to just slip that in can we just swap in Monte Carlo policy evaluation instead of our dynamic programming that would be a way to evaluate the policy so remember Monte Carlo evaluation is just you just execute some trajectories and take the mean and that's your value of your of of your value function so what about doing that is our process for estimating the value of the policy what would happen if we just ran some trajectories Ed those trajectories to estimate v um and then applied greedy policy Improvement would that work any thoughts any problems with this idea do people understand what what's the suggestion is that we could take this General policy iteration framework and plug in a model free step for policy evaluation to just run some trajectories in the unknown environment see how well we do use that to estimate the value of the policy and then iterate by improving our policy um evaluate the new policy using by running some more trajectories evaluate our new Policy Act greedy with respect to that run to more trajectories Etc um and according to our um our knowledge of generalized policy um iteration this has to converge on the optimal policy so what's wrong can anyone think there's actually two problems with this okay good so if you're if when you're running your trajectories you follow your your policy then you're not going to EXP okay great so that's one of the two problems um we'll actually come to that one second so um but the point was which is correct is that there's an exploration issue um which is that if you act greedily all the time you don't guarantee that the trajectories that you're following explore the entire State space and so it might be that there's parts of the state space which are really interesting and actually have better um potential that you just never see okay there's one more problem with this any thoughts so that's actually the problem with the second step any problems with this first step with evaluating V in this way yeah you need to run a whole load of Trials to get a good measure of what the best best policy would be any given okay so so the point was that maybe it's slow to do this it might be um you might need a lot of episodes to actually make this work that's true and we'll improve on that by using temporal difference learning later um but let me move on so so so the issue is that we're trying to be model free actually um we want to be model free but how can you be model free um when you're using the value function V if you use the state value function and you only have the state value function and you want to act greedy with respect to your state value function you still need a model of the mdp to figure out how to act greedily with respect to that value function you only know the value of each state and so if I'm in some State and I want to know well which action is best I have to imagine what would happen for each possible action I have to roll forward the Dynamics one step to look at the value function my successor State we don't know those Dynamics we're trying to be model free okay so if you work with the state value function V you always need a model in or in order to do your policy Improvement because you need to have some one step of look ahead to get to your next state so the alternative is to use Q the alternative is to use action value functions action value functions enable us to do control in a model free setting and the reason is that we can do policy Improvement in an entirely model free way like if we have q and we want to do um greedy policy Improvement all we need to do is maximize over our Q values so we've got the Q values the Q values tells us from each state how good is it to take each different action now if we want to make a better policy what do we do well we just pick the action that maximizes our Q values that's the best action it's immediate you don't need a model to roll anything forward it's already there in our in the structure that we've we've created so by caching the values of our actions we um kind of reduce the burden of what we require from our from our our model we don't need to look at the model anymore we don't need to roll it Forward is that clear okay so so let's let's drop that in so what would that look like now so the proposal then is that we have now an algorithm which looks like this um where we start at with our a q value function now an action value function and some policy and every step we're going to do Monti Monti Carlo policy evaluation run a bunch of episodes um at the end of those we've got some estimate of the of our policy of Q Pi so we run a bunch of episodes we take the mean um from each state action pair we look at the mean return to estimate qsa um and that tells us how good that particular State action pair is taking that action from that state what is the mean return that you got do that for all states and actions that gives us a new value function we then act greedily with respect to our Q that's immediate we just take this argmax over our Q values that gives us a new policy we run that policy for a bunch of more um episodes um take the mean of all our state action pairs again and iterate um and again this results in um qar and piar or does it and the issue has already been raised by um someone in the audience um which is we still have one issue which is that we're acting greedily as our policy Improvement step and if you act greedily you can actually get stuck you can actually not see the states which you need in order to get get a correct estimate of the um of the value function so we're basically estimating the value of all state action pairs by Tri and error by running episodes but if we don't see certain States and actions then we won't evaluate them correctly and we won't select them because we haven't see how good they are um so that's the distinction now that we're running things we're not doing these full sweeps that we were seeing in dynamic programming U we're running things by actually interacting with the environment so we need to make sure that we continue to see everything so here's an example to make that a bit clearer okay um so this is something from the academic world where you can imagine there's some situation in life where you've got two doors and you're trying to pick the best door so in this cartoon one of them behind that door is tenure or one of them has tenure and one of them is flipping burgers at McDonald's um so so now we have to basically figure out by trial and error which door to go through so let's see say you get repeated trials and let's say the outcome of these doors is actually stochastic okay so you're just going to get some immediate reward this is called a the Bandit problem we'll see more examples of this in the subsequent lecture when we focus really on exploration but it's a bandit problem you've got two choices um door door one or door two um so what happens now if you open the left door um and you see a reward of zero okay so you get this reward of zero we going through the left door um so now think okay zero wasn't so great maybe I'll try the right door so you open the right door and you get a reward of plus one so if you're acting greedily there's only one logical thing to do which is to open the right door again let's say it gets a um a reward of plus three um so now your mean um after these steps you've seen 1 plus 1 1 plus three so your Monte Carlo estimate of the value of that door is now plus two the mean of those two um episodes that you've seen so plus two is clearly better than zero so if you're acting greedy you would pick it again um maybe you get a reward of um plus two again um and now you've got a mean of two um so you'll open it again um and then maybe your mean is just going to fluctuate a little bit around two let's say you always get a reward between say one and three um you'll keep on opening this door forever and the problem is that actually we don't really know what the value of this door is we've only tried it once and so you need to carry on exploring everything to make sure that you understand the value the true value of all of your options if you stop exploring certain actions you can end up making incorrect decisions getting stuck in some local um incorrect Optimum your beliefs can be incorrect and you need to carry on exploring to make sure that you really understand the values correctly okay so to people see how you can get stuck in this situation and just um keep on Mis evaluating so how do we do that how do we make sure that we carry on um well there's going to be a whole lecture on this but we're going to start with the simplest possible way to guarantee um that you uh Vis continue to visit all states and all actions Ely often um and so this is the simplest idea for ensuring continual exploration it's actually surprisingly hard to do better than this simple idea um although there are lots of more very U much more sophisticated algorithms and the simplest idea is what's called Epsilon greedy exploration and it's very simple all it says is that you flip a coin um with probability Epsilon um you choose a random action completely random a uniform random action with probability 1 minus Epsilon you choose the greedy action okay so you flip a coin if it comes up heads you act completely randomly to make sure you see everything if it comes up taals you choose the best thing that you you you know so far you take your your AR Max over your Q values so it's just like our greedy um policy Improvement but we've softened it a little bit to make sure that with some small probability you try all the other actions um so this might seem naive but it has some nice properties it guarantees that you continue to explore everything and it guarantees that you actually improve your policy as well as we'll see shortly so this is just a fancy way of saying precisely that you know this is saying the probability is if you flip the coin um so this is this Epsilon over m is the probability of what happens if it comes up tals and you you explore um and this one minus Epsilon um is the probability of actually if he comes up heads pick the greedy thing but you might also pick the greedy action just by random as well if if this tail thing comes up which why I have a bit more probability Mass on this one but the idea is very simple you just either take the best action or you explore it random okay so one of the reasons that Epson greedy is is a nice idea is that it actually guarantees that we get a step of policy Improvement so remember for this generalized policy iteration idea we want to alternate between steps of improving our policy and steps of evaluating our policy and so what we can see is that Epsilon greedy actually is a policy Improvement like you you start off with one Epsilon greedy policy and now the question is if you evaluate that policy so you start off with pi that's your previous Epsilon greedy policy if you evaluate that to get your V Pi the question is can we be sure that by taking Epsilon greedy step like by acting Epsilon greedily with respect to V Pi can we be sure that we've really made a better policy um and the answer is yes um so a simple proof here which basically says that um if you look at so we're just going to do the same idea we saw in the dynamic programming lecture where we're just going to prove this over one step and then we're going to argue that that whole thing telescopes by unrolling it using the B equation so over one step we can see that the value of our normal of our previous policy of our original Exon greedy policy if we take one step of our new policy so we want to show so that this thing here taking one step by new policy is basically um better than than our original policy the new value of our original policy so this is just saying well this is just an expectation um so this is our Epsilon greedy policy um we're going to say for one step there's some probability that we're going to take each action um multiply by the value of that action this is just unrolling the expectation and we can split that into U the probability of taking the greedy action um plus the probability of taking all other every action so this is what happens if you choose to explore and this is what happens if you choose to act greedily um and now what we can say is that well this thing of where you choose to act greedily if you act greedy if you choose the max the max of all your Q values has to be um at least as as good as any weighted sum of all of your actions right the max is better than any weighted sum of your of your actions of your Q values um so we're going to choose one particular waiting that sums to one here um and we're just going to say that the Max has to be better than that weighted sum of your Q values and now you can just collect terms together and we see that we've got 1us Epsilon on oneus Epsilon so this term over here now just cancels with this term over here um and you're just left with an expectation over your Q values which is your original value function okay you can look at this afterwards but the main thing I want you to take away is this idea that Epsilon greedy very simple idea but it really does guarantee that you're making progress so you really do make progress you start off with some Epson greedy policy um you evaluate it you make a new Epson greedy policy and it's definitely better than what you started with or at least as good as what you started with okay and then for the final to telescope it um refer back to um the dynamic programming lecture that explained how how this telescoping argument works it's the same idea okay so that's the one slide of math go on question this um proof you've just shown us doesn't really tell us anything about the rate exporation really because I think if you did like a um completely greedy policy then that would actually you know prove that's better than anything not Tak into account so I'm not sure I the question so the question can we say anything about how how frequently you need to like what like the how Epsilon effects is it telling us anything about um in our real problem when we execute our real problem how much we should be exploring um no this doesn't really in itself tell us anything about how rapidly we should be exploring so um but in general um we'll see a little bit about that in a second that you need to choose a schedule for your Epsilon and you need to make sure that that decays to zero roughly but we'll see um so so let's let's see how to plug that in now so let's let's we've got we've got two ideas now we started with this generalized policy iteration framework and we basically we've changed each of these steps now we have these two processes so for the Pol policy evaluation we're now plugging in Monte Carlo as our as our way to evaluate our policies using Q using the action values so we've got this one process here where we just run some episodes using our current policy um we estimate the value of all states all actions from all states um just from what we've seen so for every state action pair we look at the mean return we've seen that's our evaluation procedure um as someone's pointed out that might be a little bit inefficient and we'll see how to improve that shortly but now we've also changed our policy Improvement procedure to use this Epsilon greedy policy Improvement where we basically have this soft uh this softening of the greedy idea so every time we go down we're not getting to the completely greedy policy we've got a little bit of exploration left and so the policy is always to St Astic along here now and that stochasticity in the policy ensures that we at some rate at least explore everything in the environment now as pointed out by the question there um the rate at which you actually see it might be very very slow still there might be some um states that are way off down some strange trajectory and you have to explore and explore again and explore again to be able to even see those States um so this doesn't actually give you very strong guarantees of how long it takes to explore all of those States it just says ASM totically at least this thing this thing now really will um finds the optimal policy at the end which should say Q star this should all be Q rather than V okay so let's try and make this a little bit more efficient so the first thing to noce um again we saw in the dynamic programming lecture that it's not necessary in these kind of policy iteration Frameworks it's not necessary to go all the way um to the top of this line every time it's not necessary to fully evaluate your policy um sometimes you can just spend a few steps to to evaluate your policy and you've already got enough information there to guide you to a much better policy without wasting many many more iterations on Gathering more information um so what would that look like in the context of Monty Carlo well let's take it to its extreme and say why not do this every single episode so we're going to run one episode going to make the robot do one episode collect all the steps along that episode update the Q values just for those steps so basically get one return and so for every state and action that we've taken along that return we're going to update the mean value just of those visited States and tried actions along that episode so one episode one sequence of updates to our returns um and then improve our policy straight away why wait why wait to get more episodes of information when you can already improve the policy so the idea is always to act greedily with respect to the freshest most recent estimate of the value function like if after one episode you can already update the value function to something slightly better then why continue using your old estimate of the value function to generate Behavior you may as well use your new updated value estimates to continue generating behavior and so we basically we Chang the sort of um the the time the rate at which we um operate in this diagram becomes more frequent the frequency increases we don't go as high up we just run one episode um and then we immediately improve the policy one episode immediately improved the policy run one more episode improve the policy is that clear okay so this question's already come up and it's a natural question which is how can we really guarantee um that we find the best possible policy like what we really desire is pie star we really want to know the best possible behavior in this environment um so to do that we have to kind of balance two different things we need to make sure that we continue exploring to make sure that we don't exclude things which are better but we also want to make sure that asymptotically that we get to a policy where we're not exploring at all anymore because we want the best possible policy and the best possible policy will not include this random behavior that is extremely unlikely to be optimal um so how do we balance those two factors um and so one idea for balancing those two factors is called this um Glee idea greedy in the limit with infinite exporation and the idea of Glee is basically to come up with any um schedule if you like for for exploration such that two conditions are met the first condition is that you continue to explore everything that you basically make sure that all states and all actions are visited infinitely often um so that's like guaranteeing that you just um that you never miss out on anything that that just to make sure that every little part of the state space will be seen um every action from every part of the state space will be tried um so for example Epsilon greedy has that property that eventually if you behave randomly and you you try all possible actions then every um every reachable state in the state space and every action from those reachable States will be tried okay um and the second property is that we want to make sure that the policy eventually becomes greedy and needs to become greedy because we need this thing to satisfy the Bellman equation and the Bellman equation is basically the bman optimality equation basically requires a Max in there not just some soft thing and we want to make sure that eventually we're acting greedly with respect to Q we're really taking the max of the arax of our Q values eventually and so one way to achieve this by no means the best way but but certainly one simple idea um is to choose an Epsilon greedy policy and just Decay Epsilon slowly toward zero um so for example if you choose a hyperbolic um schedule so each episode you basically on the Ki episode you set your Epsilon to say one over k um that satisfies these two properties that you will see everything infinitely often um but asymptotically you become closer and closer to to Optimal closer and closer to the greedy policy okay so that's one way to balance things so what does that look like in practice um well we can make an algorithm now um it's called this Glee Monte Carlo control um so this algorithm basically we start off by sampling episodes um so again we've just got our robot we throw it down we run one episode um that generates a trajectory of States actions rewards State action reward until we get to some terminal State um we sampled that from our current policy pi and then for each state in action that I visited so I got to this state and I picked this particular action for each of those States and actions we update um our action value and the way that we do that is just by counting how many times we've seen that state action pair um and doing an incremental update to the mean so this is just our incremental update to the mean which says that if we consider that particular State action pair this one over here this state and this action um how many times have I tried that um well let's just adjust um our previous me which was this we're going to update a little bit in the direction of the return that we just got and the amount that we have to adjust it to get a correct mean estimate is this one over n now what's interesting about this is we're not taking a mean um in the way that you think of a it's not like a statistical mean um over um some IID quantity now that the policy is actually changing over time we're iterating over our policy we're improving our policy but we're basically taking return returns from better and better policies into account so as we improve our policy we're Gathering more and more statistics our policy starts to get better and we collect all of those statistics together to get us one overall mean of how good it is and the Glee property basically ensures that over time that these statistics that we're collecting really sort of converge on getting the mean return for um this these policies sort of get more and more like the greedy policy and so we basically find out more and more information about the policy that we really care about that the past kind of gets dominated eventually by um by these new policies um and so what we're going to do now is we're going to iterate over this whole process so we're going to sample the Casi episode in this way um update our Q values and then improve our policy so this was the policy evaluation step now the policy Improvement step just looks like this where we can for example set Epsilon to its new value one K and now we're just going to act Epsilon greedily with respect to these new Q values and those Q values are going to be the same everywhere apart from the states and actions that I just visited so we only update those Q values along one trajectory but that's already enough to change the behavior that we we actually see there so in practice if you're actually writing this you know you don't need to explicitly store Pi you just store q but when you actually come to select actions you always just look at the freshest most recent q and you flip a coin and you either act um you either pick the maximum Q or you pick a random action um so this Pi is kind of implicit and all you need to remember is your one overk schedule that you're using to determine the bias of this coin that you're using and this algorithm actually finds the optimal action value function so this is our this is really you know this is our first full solution this is something which you throw it into any mdp and you just let loose with this thing um and it will find the right solution and it's considerably more efficient um to run it in this way um where you actually update the Q values every single episode you update your Q values and then immediately improve your policy rather than having to generate a whole batch of episodes thousands of episodes to just get one evaluation of que so this actually works much better in practice okay are people clear about this algorithm like the idea it's a good time to ask questions if you're not because we're just going to keep on adding to these ideas okay good I'll take silence to mean comprehension um okay question and I can't remember if you addressed this in an earlier lecture but is there any particular um idea behind the initialization of the values for Q or do you do random initialization okay so the question is does it matter how you initialize Q um and in terms of the theory you can start with anything it will still converge in practice of course it helps to have a good estimate of Q the closer you are to the optimal Q values the faster this thing will will um will converge um it's a bootstrapping uh sorry it's not bootstrapping like with TD but still you um would basically starting from these Q values and we're we're updating incrementally from um ah actually sorry sorry that's not true with this particular schedule let me let me back off that we'll see that for later versions where we have an incremental update um so for this particular algorithm it makes no difference because we're just taking the mean um and and so the very first time you see a new state action pair you will replace any Q value that you started with completely um because n will be one you completely replace your old Q value with the with that return that you saw right so for this particular algorithm it makes absolutely no difference what your initial Q values are in practice um and in the sub the rest of this lecture we'll see algorithm where you don't have a like this perfect mean instead you have some non-stationary estimator that has something like a constant Alpha here and for those algorithms it matters much more what Q values start with because you're incrementally moving on from that initial start value and then your start value the closer you are to Optimal the better so yeah sorry correct myself okay question then I'll move on you said it's more efficient to upate for every episode yes but um if you do it for many epis you to okay so so that's a great point so so I guess yeah I'm just focusing on serial computation for now um there are many possibilities for exploiting parallel computation and you're right that you would want to that parallel computation introduces constraints on what you can and can't do and um and you may want to it's almost I think the same is roughly still true with parallel computation that you still want to use the freshest possible Q values that you can um but with parallel computation um those might necessarily be a little more stale than the very latest episode to come in because you simply don't have access immediately to all of that parallel computation okay so let's just do a quick example so we had this multical example in a previous lecture um in um last lecture we just looked at evaluation we basically said okay if I send you into a casino and I tell you that you have to um stick um and let um basically unless you're always going to ask for a new card unless you're on 19 or 20 or 21 I think was the policy we had before um and then I want to know how much money would you win or lose in that casino so that's what we looked at before and we had a particular plot of the value function um and it had a particular shape to it and now what we're doing is we're actually running this um algorithm we saw on the previous slide um to generate the optimal value function so now this is the question you probably really want to know which is you know if I walk into a casino what policy should I play to get the maximum possible amount of of of reward um and again real casinos use slightly different rules don't try this um but I'm not responsible if you lose all your money um and and so this basically tells you um now we've run our Monte Carlo we've iterated through several process is where each episode we update our value function a little bit we actually store Q internally but just to make the plots um easy to look at we're now basically summing over all of the actions to estimate varar from from our Q so you basically have some Q um that generates some Epsilon greedy U policy and then you alternate backwards and fors by running these episodes just like we saw in the last algorithm and you end up with a policy which looks something like this U which says that if you don't have an ace in your hand um you should actually have this quite interesting looking policy is the optimal policy to follow where depending on what the dealer has here and what cards you've got you know it's this threshold uh somewhere between like 16 and 17 where if you've got 16 or less it's better to ask for a new card um but there's some situations where a a dealer has some kind of um quite difficult card to play with where there's a high probability the dealer will go bust in which case you should probably just stick anyway and just wait for the dealer to go bust Okay so you kind of vary your policy according to that if you've got an ace in your hand um of course you can afford to to ask for more cards more often because that Ace can either be a one or 11 and then this is what the value function looks like so it's just a continuation of that last example to show you that you know this this process really does end up with usable information in the sense of you know you get these nice um shapes um out of these value functions that really characterize exactly the optimal behavior in this in this domain and you really end up with something which which is the optimal policy there's no way to do better than this um in this particular game that we've described this is the optimal way to behave okay why why would you ever stick at 11 um you can't go bust you wouldn't it always says to to hit on the 11 when dealer is showing between four and six so this line here says if you're on 11 you always hit this here so the line is between 11 and 12 so so it's like a um a grid and each cell of this grid is indicating what to do so this shaded this this could be shaded in if you like sh you for 11 you should always um you should always hit but if you've got 12 and the dealer has a bad card there's some chance the dealer might go bust and so you should just stick on that 12 sometimes okay so we're now going to move on to Temple difference learning methods so just like the last lecture we kind of split things between Monte Carlo learning and then we moved on to TD learning and then we saw that there was a spectrum in between these two things we're going to follow the same process now but with control methods rather than evaluation methods we're really going to try and understand how can we actually gain efficiency by bootstrapping by using the same tricks that we saw last time um and gain a lot of the efficiency um we want to basically use TD why because it can be lower variance um because it's can be uh Run online um in continuing domains even if there's no ter you can still run this it can be run from incomplete sequences and what we'll see actually in this lecture is that there's an additional benefit using TD which is when you use off policy learning it becomes particularly important to use TV um and so the natural idea well let's try the obvious thing which is use the same generalized policy iteration strategy um but we we know that to do model free policy iteration we need to use q u we need to use Q so that we don't have to do any look ahead in our um to do our greedy or Epsilon greedy policy approvement and the idea is basically to use um now TD learning to estimate Q so let's just slot in TD learning um in place of Monte Carlo policy evaluation and continue to use Epsilon greedy policy Improvement and that's immediately going to give us an algorithm um that we can apply U which is probably the best known algorithm in in reinforcement learning um and the only difference we're going to do is because we're TD is can operate at this higher time scale so TD learning is something where you don't need to wait until the end of the episode to update your value function in fact you can update your value function after just one step that's one of the advantages it's online like you see one step of data you bootstrap you update your value function immediately and so again if we follow this General principle of always using the freshest most recent value function to to um pick your actions what we're going to do is we're going to increase the frequency of our policy Improvement to be every single time step we're going to improve our policy okay so the general idea is called Sasa um so why is it called Sasa well this diagram should illustrate it we're basically starting off uh in some State action pair that's this black dot a decision node we're basically choosing an action we're going to randomly sample up policy um we're going to see a reward R we going end up in some new state S Prime and we're going to our policy again to generate a prime so there's two steps of um of sampling sorry we're not sampling our policy yet so we start off with State here um we're asking a question about this state s and this particular action a we're going to sample from the environment to see what reward we end up in and what state we end up in here um and then we sample our own policy at that next state so we end up with s r s prime a prime or Sasa algorithm okay so s Sasa basically indicates a particular update pattern that we're going to use based on Sasa here um and so what do we do we're basically going to start with our Q values we're going to move our Q value a little bit in the direction of our TD Target reward plus our discounted Q value of the next State minus the Q value of where we started so again it's like I was in some State and I'm considering taking some action and what I want to know is um if I actually take that action look at the reward I got and then the value of the next action which I would take that gives me an estimate of the value of this policy and I'm going to use that estimate to update the value of the state action pair I started in okay that's the general idea this comes from the Bellman equation for Q U we've seen this hopefully before becoming familiar with these type of ideas so that's a sassa update okay um so now we're going to take these Sasa updates and we're going to plug them in to our generalized policy iteration framework so every single time step we're going to move up and down in this diagram every single time step so each time we take a step we're going to update our value function by applying one step of Sasa so for the state and action that we were just in we're only updating the value function for that one state action pair of that time step by applying this update so I was in this state action pair I ended up in some new state action pair I'm going do One update to my Q value for that single entry of my table um and now I've already changed my value function like something's different if I end up back in that state action pair again which I might in some loopy environment I want to make sure that I use the latest most interesting um most upto-date best information um that I've got from my my policy evaluation so every single time step we're going to to improve our policy as well by acting Epsilon greedily with respect to the latest value function so again what does this mean in practice it means that you know you're just thring Q in your memory you've just got this big table of of Q values for all states and all actions and every step when you actually come to take an action you just flip a coin you look at your Q values um and if it comes up heads then you look at your Q values and pick the best one if it comes up Tales you explore randomly um so the policy is implicitly represented by your Q values and every single step we're going to we're going to evaluate the latest action I took and update my policy just for that one state action pair okay so we really got this very rapid process of evaluating basa and improving the policy okay let's try and get to know this a bit better so what would an algorithm look like so I think it's actually useful in this case to just maybe step through some pseudo code it's really straightforward um so you can arbitrarily initialize this lookup table Q is just a lookup table in subsequent next lecture actually we'll see how to generalize Beyond these naive lookup table representations but for now it's just a lookup table every state every action has its own entry in this table we initialize this arbitrarily just say zero okay and now we're just going to repeat and every single step this outter Loop is just over episodes but really you know the inner loop is just every single step we're going to um take an action observe the reward observe the next state that we end up in um we're going to choose our action using our Epsilon greedy policy um and we're just going to apply this um Sasa update to that one step so update a little bit in the direction of my TD Target the reward plus the discounted value at the next data action pair um and then repeat so the next time we come around we'll already have a different policy because we're acting Epsilon greedily with the respect to our Q values if I end up back in that same state again for example in a loop I will already behave differently using the latest and freshest information stored in que okay yeah question s Isen ACC to as well s Dash is chosen Accord by the environment s Dash is the next state um so States sorry yes a so a prime um so the question was is a prime selected according to the same policy um so so yeah so a prime is selected using the your current policy and then you basically remember that and now when you come in you remember what your last a prime was that becomes your a at the next step you just kind of um remember your last a prime and that becomes your new a how does that the that you're choosing that I mean there's no reward coming from that last right the reward the that you're using there is from yeah um so the question is you know why does this even make sense as an as an update I guess um and so let me try and give some intuition and I think the right intuition to have is comes from from the Bellman equation that really what we're trying to do this is just a sample of the bman equation this this right hand side that we're moving towards this this thing here is Target um is a sample of the bman equation we're basically saying what we want to do is we want to consider an expectation over everything which happens in the environment over one step um which basically is the reward um plus whatever state you end up in next so that gives you your expectation of the environment and then this um and now we want to know what would happen under our own policy we want to know what would happen in under our own policy after that point we want the expectation under our own policy from that state onwards and that expectation under our own policy is precisely what we're trying to evaluate with Q so that's why it has to be evaluated we have to select actions using the policy that we're trying to evaluate this is an on policy algorithm fundamentally on policy uh we're selecting actions and we're evaluating that policy so Sasa is an on policy algorithm okay so you should be wondering well does this work um and just like Glee uh Monte Carlo um this version of Sasa actually will converge to the optimal policy and the only thing we require is again Glee policy so again you could choose your Epsilon greedy with this decaying schedule that would be a valid choice just to make sure that you continue to explore everything and that you eventually end up greedy um but you also need to be careful about the step size that you use so you have to make sure that your step size if you really want convergence you need to look at stochastic approximation Theory and you need to make sure that your step sizes basically follow these two conditions um and all this tells you is that the step sizes are sufficiently large that you can move your Q value as far as you want so you might need to move your Q value from whatever your initial estimate was to some very very large or very very low value if you're wrong and this thing basically tells you that um eventually the changes to your Q values become smaller and smaller and smaller the changes to your Q values have to vanish and become zero eventually otherwise you'll still have noise and jumping around in your policy so that's just St castic approximation Theory but really when you do all that if you do all the Machinery correctly this really converges in practice we don't worry about this and we sometimes don't worry about this either and Sasa typically Works anyway that's an empirical result um but this is the theory okay so let's have an example so this example actually is doing precisely that is throwing out these this Theory and just using a fixed step size um and a fixed exploration rate let's see what happens um so this is called a windy grid World um we're just trying to wander around from this start point to this goal um we can take um any of these kings move operations um so all the diagonal moves as well as NorthEast Southwest U and the only special thing about this grid world is that every time we take a step the wind pushes up up pushes upwards and pushes the agent upwards some number of grid cells which are indicated by this number at the bottom of the column there's no wind over these first three columns then you get pushed up one step when you move here two steps when you move up here one step here and zero here um so the optimal behavior is basically you has to take account of the fact you want to make as much progress as you can when there's no wind and and uh I think you end up kind of taking this route we'll see on the next slide something like that as the optimal for heer you have to come back around to get back to the goal would you want to just go down and then be blown up to the goal um that's not optimal um if you try to go down then what happens is um you end up being blown up one blown up one again blown up one again blown up two you end up overshooting the goal okay so um so you can try to do that but you don't hit the goal so you're better to get pushed up against the ceiling um come back around from the ceiling after that point and come back back to the goal again um but if you can go diagonally downwards diag to the right then those first three you can just go hor the one sorry the where you're blowing up one um so maybe this is a version that just uses the standard moves then I'd have to check that I think this might just use standard moves and that that wind blows you into a diagonal Direction you're right that would be two in that case so here is the diagram op policy so this plot basically shows what happens if we use the naive version of Sasa so it's just something to show you that you know this is the policy that it finds which I claim is the optimal policy you can check that offline um and and this plot here basically shows the the learning curve for running Sasa um so this is exactly the algorithm we just looked at um and what this is showing is actually um the number of time steps so this axis here is literally time step by time step and what we're looking at is how many episodes were actually completed in those number of time steps so you can kind of see here that the first episode took 2,000 time steps to complete it started off with no knowledge about how to do this and so it's basically just doing a random walk and that random walk is biased upwards so it kind of it's not even as good as a random walk so it takes a long time to actually stumble on the goal um but when it does and this is really typical of reinforcement learning and and these generalized policy iteration algorithms when it does the next very next episode it does much much faster and you get these exponential increases because once it learns some knowledge it can boot bootstrap from that knowledge and do better and better and better and better as it starts to learn things so it immediately learns faster um and starts to complete episodes at a much faster rate now um until up here you can essentially look at the gradient of this and the gradient of this should tell you the the path length which is taking taking which I think is like 14 or something the ratio between this and this which is the optimal policy so most of the time it's behaving optimally but this is still because we're not decaying Epsilon um it won't be completely optimal because it will still be throwing in the odd exploratory action okay and that's why it's got this bubbly look to it okay so now we're going to take the step that we took in the last class which is we're going to consider the spectrum between Monte Carlo and TD learning algorithms we're going to consider these eligibility traces and Lambda variance of these algorithms to see if we can again get the best of both Wells get the best of um of the unbiased behavior that you get from Monte Carlo and also get the best or at least control the bias have a have a knob that controls the bias variance tradeoff um and so the way we're going to do that is by start off by considering these endep returns again um and so what we're going to do is consider these um U returns now um which are basically um the one step return looked at one step of real reward and then our Q value after one step but we could also use the two-step return we could have looked at one step reward here a second step of reward here um and then look at our Q values after two steps or we could do that for any number of steps all the way down to Infinity which is essentially Monte Carlo where we look at the reward after one step reward after two steps all the way until the end of the episode and we never bootstrap from our value function okay so that was the idea of these endep returns um and so this Q return we can Define in the following way we can Define the Q return um to basically be the generalization of this to any n and nstep Sasa just does the obvious thing where we say for each state action pair what we're going to do is instead of updating towards the one step um estimate of our value function using um Sasa we're going to use um this endep return as our Target so again this means we're going to take our estimate of the value of taking action a in state s we're going to update that a little bit in the direction of our endep Target okay this is this sort of idea of increment will mean it's nonstationary updates again um and this endep target now is you know essentially saying well I'm going to look at you know one two three steps and then bootstrap from my Q value which represents the value of the remainder of the trajectory and can use that as my estimate of the overall value of the trajectory and use that to update my value of the state I was in and the action I took from this starting point and we can do this for any n um and so again just like in the last lecture we want to be able to um consider algorithms which are robust to the choice of N and which average over many different NS you want algorithms which kind of get the best of all possible n yeah so I want to ask just for reminder so the Q function represent like the the estimate of of the whole trajectory from s using action right yes not the not TR only the one step yeah that's right so so all value functions so so let me be clear about that so all value functions including V and S um are long-term estimates of the amount of reward that you get so the definition of of of Q is the expected total reward that you get um if you start in state St and then you take action at and then you follow your policy for all subsequent actions that's the definition of q um so so specifically as a long-term estimate um we never consider um just these like even when we talk about these endep returns um that's just our our Target it's still an estimate of the complete return um but we're estimating that by looking at end steps of immediate reward plus something which summarizes the the remainder all the way out to infinity or the end of the episode um of the remaining reward all discounted by the appropriate discount each St okay people good with that right so let's make a Lambda version of this algorithm so this algorithm is called Sasa Lambda very well-known algorithm U so what we're going to do is we're consider first of all just the update like what does this look like before we talk about the full algorith um and the idea is we're going to basically average over all of our nstep um um returns so we're going to consider starting in this data action pair um the environment pushes me to this state I take some action um and then I estimate my Q value here um and we're going to wait this um particular return by 1us Lambda we're also going to say well what if I start in this state action pair environment picks a state I pick an action environment picks a state I pick an action we're going to look at the Q value add up the rewards along here plus the Q value at the end of this weight this by 1us Lambda time Lambda so each time we're going to multiply by another patch of Lambda this oneus Lambda is just like to normalize the whole sum to add up to one um so we get this weighted sum of all of our endep returns where we basically are looking um we take more account of of um the trajectories which are of the shorter n um and then we progressively discount by another value of Lambda per as we increase in so so the main idea is to now make this average return so we had this uh these returns which were endep returns and now we're going to average over all end and the way we're going to average over them is just by defining this waiting just like over here we're going to wait each each endep return by a factor of Lambda to the nus one just normalizing Factor there um and so when we make this Lambda return this is just a way to average over all of our endep returns in some way that's a computationally convenient um and um B takes account of all different n okay and it has this Lambda parameter that we can tweak which can make us either more or less farsighted in terms of how how much we prefer the large n or the short n okay and so this gives us now we can plug this right into our um into our update um and so again what we can do is we can say now I'm going to update my Q value you know I start over here and now I'm going to consider a weighted sum for this Q value I want to know how good is it to take this action and I'm going to consider some weighted sum that takes account of my estimate based on my Q value my rewards plus my Q value from here but also based on my reward plus my reward again plus my Q value from here onwards we're going to weigh all of these things together all the way up to n equals infinity um that's going to give us our Q Lambda um and we're going to make that our Target and we're going to update our original Q value a little bit in the direction of that Target okay so that's the idea of Sasa Lambda um we'll see an illustration in two slides um just before we do that I want to spend one slide just saying that this was the forward view if you remember from the last lecture there was a forward View and a backward view of how to use TD Lambda that with these eligibility traces um we can actually um come up with a back so so what's the problem with this this approach so far so this is great we've got something which lets us build a spectrum between Monte Carlo algorithms and and um TD algorithm so we've got now a variant of Sasa that can look um all the way out to the future if we set Lambda to one it turn turns into Monte Carlo learning if we set Lambda to zero it turns into our familiar Sara that we saw in the previous steps so now we've got this this control where we can choose you know our bias variance trade-off in between and and average over all of these things um the only problem is that we're kind of looking forward in time that um this isn't an online algorithm it's not something where we can take one step update our Q value and um immediately improve our policy because we have to wait until the end of the episode um to be able to compute our Q Lambda and we would like to be able to run things online um and be able to get the freshest possible updates updates immediately um and have this kind of plot taking over every step of getting the maximum amount of our policy Improvement by making an adjustment every single step running online we don't want to have to wait until the end of the episode there might not even be an end of the episode this might go on forever um so what we're going to do is come up with an equivalent just like in the last class um by building up eligibility traces such that when we sum up over those eligibility traces we're going to end up with an update which is exactly equivalent to this one um and so the idea is very similar to the last class U we're going to build an eligibility Trace now um but the eligibility Trace um is going to be for All State action pairs so now we've got a new table a table of our eligibility for each state action pair think of this is the thing which is telling you how much credit or blame you should assign to every action you took from every state like you end up at the end of your episode you know you get a big carrot at the end of your episode um so which of the states and actions are responsible for that carrot and your eligability Trace is your best estimate of who was responsible for receiving that carrot and so what we do again is we say that the states and actions that I took that were most recent before I got the carrot um and the states and actions that I took most frequently along the trajectory are the ones which should be blamed or or credited the most for their negative or positive reward that I receive at the end of that trajectory and so the way we do that then is we can do it online by just increasing this state action pair so every time we actually visit a state action pair um so if we actually visit it um we increment it so this is just something that says this is an indicator say if I'm actually in that state s and take that state that action a um then I'm going to increase this eligibility Trace by one okay so if I actually see that particular State action pair increase my Trace and then every step for All State action pairs even the ones we don't visit we're going to Decay them a little bit over time you kind of have this process where you know if I'm going around and this is the state um that we're really considering here and if I keep going around and every time I revisit this state I'm going to bump up my eligibility and over time it's going to Decay until I get back there again and it's going to bump up again and then if I receive my carrot whatever the latest eligibility is that's how much blame I'll put on that particular State and so what do we do then when now we've got these eligibility traces um we just update our Q values um for every state in every action we update in proportion to the TD era that's the the event that that happens of getting the carrot or the surprising events where you got more or less reward than you expected and the credit or blame that you assign is proportional to your eligibility Trace so now the algorithm is to say let's update our Q values in to the TD error the difference between what I thought was going to happen and what actually happened um multiplied by the this credit assignment choice this eligibility Choice Yeah question um is there a way to apply this to very large State spaces where you don't effectively ever visit the same space twice where you're using function approximation for your value functions so the question is can we do this in very large State spaces where you can't store this table next class so next class we deal with function approximation um for all of the algorithms we've seen so file we will generalize to function approximation in the next class and so the answer is yes you can you can do that and should do that because table table lookup is typically naive and you can't solve large problems until next class and then you'll be able to so Sasa Lambda um it looks something like this um so we're going to basically initialize our eligibility traces to zero at the beginning of each episode you want to say well I can't blame anything yet I haven't seen anything and then for each step of the episode um we we again we take our actions using our policy we're again on policy here so we we pick our action from our policy for example acting Epsilon greedily um we compute our TD eror so we look at the difference between the reward plus the U the Q the value of the state and action I ended up in minus compared to my previous estimate so this is just the one step errow between what I thought the value was going to be before and what I think the value is going to be now um we increment our eligibility Trace um we decay all our eligibility traces down here so here we're just incrementing the state action pair we actually visited but here we're decaying it for all states and actions and for all states and actions not just the one that we visited now we do an update because um we need to update everything um in proportion to its eligibility not just the thing which we visit at this step everything might be blamed now or credited for what happens so we need to update all state action pairs we update them all in proportion to the TDR and the eligibility trace and we just iterate every single step um so what does that look like um so I think this hopefully will help to get some flavor of what the difference is between Sasa Lambda and and standard Sasa so imagine you're in some grid Wells where you're trying to get from here um to this goal State um this could be like the windy grid well we just looked at and you do some random walk to begin with and eventually you end up there okay and now let's say you start off initializing all of your state action values to zero okay and now what we're going to do is indicate the value of those State action pairs by an arrow in the direction of that state action pair and the size of that Arrow indicates how big our Q value is for that particular action so what would the updates look like well if you just did one step Sasa um you get to the end here you started off thinking you had a value function of zero everywhere you get to the end and you're like aha I actually got a reward of one once I reached there um so surprise um now you need to make some updates to adjust your the values of your trajectory towards what actually happened but in Sasa um if we look at the updates which are made at every step at every other step apart from the final one like here we thought the value was Zero after this step we also thought the value of zero because we initialized all our Q values to zero so nothing changes here over here we thought the value of going right was Zero we ended up in a in a state where if you go right you get a value of zero so again nothing changed here the only place you get a change was in this very final step step here where you thought you were in a situation where if you go north you got a value of zero but then you ended up in a situation where you got your your car so now this guy will be updated and you'll end up adjusting this one to say okay now you're going to have a very nice positive increment you're going to increase this value of this action going north to some higher value and you'll think that this was a good action now if you run another trajectory and you end up coming back through that same action that will now propagate backwards one further step so if you if you were over here now you start in this situation where you thought the value of going left was Zero you ended up in a value where you're now estimating the value of going north to be something High um so now you would increase the value of this guy the value of going west from here um you would increase by some large amount at the next step that will propagate backwards at the next episode it propagate backwards again you need a lot of episodes you're only propagating the information back by one step per episode in Sasa zero okay in Sasa Lambda it looks quite different so the Lambda parameter determines how quickly how far that information should propagate back through your trajectory so again if we consider the same trajectory um you would build up your eligibility all the way along this trajectory so each of these State action pairs that you visit would have their eligibility Trace increased um the first ones would have decayed quite a bit down towards zero the ones more recent would have decayed less and now when you actually see this reward of one at the end um you would increase all all of those State action pairs in proportion to your eligibility Trace so that means all of them when you see this surprising positive benefit all of them get updated in the direction of what actually happened so this information flows backwards in just one episode most of the way back towards the beginning of the episode so you get a much faster flow of information backwards through Time by choosing this Lambda parameter so the idea of Lambda um is it um so we can say that it basically defeats the tyranny of the time step that if you if you have many many many time steps Sasa zero is always susceptible to the fact that you need more and more steps to actually flow this information back whereas using a Lambda parameter you can always just pick your Lambda so that information flows back at the right rate regardless of the granularity that you make your time steps operate so Lambda basically overcomes this tyranny of the time step okay any questions there before we move on to the final section yeah said no or they so if we go back to the um to the updates and the algorithm the eligibility Trace is incremented in all the states and actions you visit but it's also decayed every time step it's decayed by a factor of of Lambda and also by your discount Factor gamma um so that's why you this Lambda parameter sets the rate at which you how far back in time you look because now you've got this Decay rate um which basically tells you how steeply um you want yourability Trace to Decay going back along this trajectory so if you set Lambda to one this these will be of equal magnitude like these arrows would still be thick like the updates that you make all the way along the trajectory would still be large and that's what happens in Monte Carlo because in Monte Carlo you run your trajectory you see that you get some eventual reward and everyone gets updated in the direction of that reward because everyone says hey you know I took this action I ended up getting something good I should be updated um so by tuning your Lambda you basically determine how far back along this trajectory you look it controls your bias variance tradeoff the further back you look the more variance you have because there are all these random things which happen along this trajectory um um but you reduce your bias you're less susceptible to um um less SU effect of bootstrapping okay um question that back you said that for you don't need so okay the great question so so the question was you know why wait until um so for the backward view the advantage is supposed to be you don't need to wait until the end of the episode um so we're actually running an update every single step um now the fact that the reward only occurs at the end of the episode um means that you don't actually gain any information until you reach there um so these arrows are basically showing you um by the time you get to the end of the episode what's the effective update that you see um but the nice thing about imagine that you pass through this reward and then you carry on and then you get another reward and then you get another reward and then you get another award um so what would happen with Sasa Lambda is that you nothing would happen nothing would happen nothing would happen until you hit your first reward and then immediately all of these guys would be updated even though you haven't finished your episode you was carry on and you was carry on and you carry on until you hit your next informative event um then everything would be updated because of that next informative event and you would carry on and you'd carry on um so the fact that nothing happens isn't because we had to wait until the end is because actually that we gained no information until this point like until this point we thought the value of zero everywhere and we got zero so we are making updates at every single step it's just those updates contain zero information until the first informative event um but you're right I think this example is is not clear from this example that that um is really a property of of the online algorithm okay there was one more question yeah so uh b2v action values um and over many episodes we're basically going be taking the me of these um so we in in Sasa we never take an explicit we just use these Sor updates um so the the updates that we do are precisely the updates that we saw in the in the algorithm which are basically updating the Q values a little bit um in the direction of um the so if we see some reward where we thought there wasn't going to be a reward that generates an error and we update every value a little bit in the direction of these um of our TD eror multiplied you know our unexpected reward that we got um multiplied by the trace by the credit that we're assigning to each of those States and actions that's it that is the process by which we estimate the mean so there's no additional step of taking the mean um it's an incremental way to estimate the the mean value if you like the expected value um but we're not we're not taking we're not forming explicit mean and furthermore we're forming this non-stationary estimate um and actually that's a point I should make more explicitly which is if we look at the updates that we that we make um so we use this fixed step size Alpha so could you just take the mean by using say a 1/ n like could we do the same thing that we did in in Monte Carlo where you use where you count the number of times you visited that state action pair and plug in one over n um um as your step size that would give you like a mean an incremental way to estimate the mean but the thing that you're plugging into that mean is changing all the time because you're bootstrapping so so it's not really the same as forming a true mean you're forming like a non-stationary mean of that you're throwing in there all of these different targets that you've seen at all these different times so so I think the right answer to the question is that we are forming an estimate of the expected reward expected future reward all the way to the end of the episode um one way to estimate that is using Monte Carlo learning which explicitly estimates the mean another mechanism for estimating that expectation is by bootstrapping um and that's sort of instead of computing the mean we now we now update a little bit in the direction of our of of the errors that we see so we bootstrap we look at the error and we use that error to inform what we think the the previous value should have been and that is the process by which we estimate the expectation okay um and and again this the equivalence between the backward and forward is not obvious but it is possible to prove there was a proof in the last lecture and that proof follows through for the control case as well roughly um and but I think a useful way to understand what's going on is to understand that really what we're doing when we do these traces is to implement this algorithm here so so we're implementing an algorithm um in an incremental way which is basically updating our Q values a little bit in a direction that mixes over our estimates at all these different time scales where we bootstrap from all from end steps from all different endep returns so we just you start in some State action pair um so if we go back to this view here you know another way to look at this what would the forward view look like so forget eligibility traces the forward view would say you know you start at each point in this diagram um now you really do wait until the end of the episode that's the forward view um at the end of the episode um from each of these states we would consider the one step return which will give you no information we' consider the two-step return which will give you no information consider the three step return which will give you no information and the only one which will give you the information is the the longer returns that go all the way to here or Beyond okay and the amount that that you take account of those things um the weight that we put on that final um n on that largest n the weight that we put on our final one is precisely the amount by which we would update that guy here so if you're very close to the end here all of your weight goes on your um on your return but on your on your final return because the episode stopped here whereas if you're way back at the beginning here a lot of your weight um is going on these shorter returns that don't contain any information yet and so that gives you the same effective result you can see just intuitively that just looking at these diagrams you get the same picture by either this forward view computation or the backward view computation um but the back of view is computationally much nicer we do that online you just keep one step in memory you just do that thing you move on you don't need to look forward in time or wait until the end of your episodes it works in situations where you don't even have episodes can I just ask one question about the size of the just storing the eligibility Trace yeah in the computer so is that a kind of Matrix which is has got two Dimensions one is the um number of states and the other dimension is the number of actions yes so so so in in this lecture we're building tables for both Q which is a matrix number of states times number of actions and um eligibility trace and the count when we were doing MTI this NS Comm a all of these quantities were tables and the tables have were number of states times number of actions in in size um you can store that as a matrix if that's a helpful way to think of it in the next class we're going to look at ways to approximate that potentially very large matrix by um an arbitary function approximator with a smaller number of parameters for both q and also for eligibility traces okay let's um change gear a bit um now we're going to talk about of policy learning so everything so far has considered the case of on policy learning where the policy which I'm following is the policy that I'm learning about um but there are many cases where we want to consider how to evaluate some other policy while we're following so let's say we're going to follow some Behavior policy that we're going to call me so this is the behavior policy that we actually pick actions from in the environment so me and what we want to do is to evaluate some other policy Pi so as to compute say VP or qpy or ultimately figure out maybe the optimal way to behave in that environment but the behavior that we see is not going to be drawn from this thing that we're that we're wondering about so why would we want to do that why would why do we care about off policy learning um so so here's four possible reasons that you might care about this a lot okay um the first reason is that you might want to learn from observation of other agents like humans you might want to see how humans behave um and not just do supervised learning to copy what the human did but look at the traces of behavior look at the episodes of behavior that the human executed and actually just from that experience figure out how to do better figure out how to solve the mdp just from watching some human you know Steer some boat around and then look at what they did and say h i know how you could have done better you could actually have got more award by following this other path around um that's the goal okay you want to be able to observe some other policy and maybe figure out how to do better or learn from that behavior learn um so learn from one Behavior how what the value of another Behavior would be a specific case um we use this for example in the Atari work in um that I mentioned in the first lecture is to reuse experience from old policies um so this is a case where imagine now that we're doing our generalized policy iteration so we move through a whole sequence of policies when we do that we start with some initial random policy Pi 1 and then we make it a little bit better to give us Pi 2 um eventually we end up getting better and better policies um but what if we want to go back over the data that was generated using those old policies and use it more than once instead of just discarding this data which is very data inefficient what if we want to go back over it and look again and say oh I remember I did this thing I've seen this state before when I was last year I did this other action I got this other award reuse that data do something a bit more like you know batch methods where you kind of um you just treat this as a big data set and and figure out how to do the best thing from that data set but what's the problem there the problem is that we're trying to evaluate our current policy when we're doing policy iteration we want to know this is our current policy we want to know how good that is but we've generated Behavior using all kinds of different policies so is it possible to make use of this additional data to get a better estimate of our the value of this final policy and the answer is yes if you use off policy learning you can basically just treat this as other behaviors that generated that data just like the human you know this could be think of this as the human and this as some other robot and and you kind of just want to look at all of that data and figure out from all those other sources of experience how to do better um probably the best known example of why of where um policy learning is used is this third point which is to say we know that there's this big um issue in reinforcement learning which is making sure that you explore the state space effectively at the same time we want to learn about the optimal Behavior which doesn't explore at all so why not make use of off policy learning to generate now we can have an arbitrary explor policy it just explores around with as much exploration as we want um and at the same time learn about the optimal policy that requires us to be off policy we want to learn about how to behave optimally something which is ultimately going to be deterministic whilst following something completely random okay that's another motivation for our policy and perhaps looking forward you know to Grand divisions the most exciting reason to care about off policy learning is we might want to learn about multiple policies while following one so there might be many different behaviors we want to figure out like I might want to know what will happen if I leave this room how much reward will I get you know will you guys be pleased or or disappointed I don't know you know maybe I should figure that out whilst carrying on talking to you um but I might also be wondering what would happen if I changed the topic and went back to three slides ago you know there's all kinds of questions I might be asking about the future which are conditioned on following different policies um and I want to know what would happen if I followed all of those policies and I want to learn about them all from one stream I only get this one stream of data that we call life and I want to learn about all of these other policies um and figure out the answer to all kinds of different conditional questions whilst following that one stream of experience okay these are all reasons to care about off policy learning so it's a it's a significant problem and it's actually a thorny problem for reinforcement learning as we'll see next class um but we're going to look at two mechanisms for dealing with this um the first of which is important s playing oh used a lot of time advertising it so um let me talk about this um so the first mechanism is to basically do important sampling and the main idea there is to take this expectation um there's some expectation like your expected future award and all we do with important sampling um is to say an expectation over you say your future reward is just to sum over some probabilities times say how much reward you got and now what we can do is just multiply and divide by some other distribution um and now this ratio here that we've got you can basically say this is an expectation over our other distribution of this quantity here where all we had to do was kind of divide out our old probability basically you divide you multiply and divide by the ratio between your old and your new distributions and that corrects for the change between your distributions um so we can apply important sampling to Monte Carlo learning um by basically doing important sampling along the entire trajectory so when we important sample AC in Monte Carlo we have to look at the trajectory we have to multiply these important sampling ratios across the entire trajectory like every single step there was some action I took according to um my behavior policy and there's some um probability that action would have been taken under the behavior I'm actually trying to learn about and so it kind of you have to multiply these ratios together and in general it becomes vanishingly small probability that the return that you saw under your behavior policy actually matched gives you any information at all about what would have happened if you followed some completely different behavior um so you can use this idea but it's extremely high variance and in practice it's just useless okay so Monte Carlo learning is a really bad idea of policy it just doesn't work because you're over many steps your um your your target policy and your Behavior policy just never match enough for it to be useful and so you have to use TD learning when you're working off policy it becomes imperative to bootstrap um and and so the simplest idea to use TD learning is just to import sample but you only need to important sample over one step now because we're bootstrapping after one step so all you have to do is basically update your value function a little bit in the direction of your TD Target um and the TD Target now all we're going to do is take that TD Target like what happened over one step of this action we're just going to correct for our distribution over that one step now we're going to say you know over that one step I had some Behavior policy that generated some action and there's some probability that I would have taken that same action under my actual um Target policy I'm trying to learn about and we just consider the ratio between those two things and we multiply our Target by those things so we wait how much we trust this target we look at by how much our our policy really matched what what was actually taken in the environment if these don't match at all we can't we can't consider this you know essentially this basically leads us to um contently reject something or to take huge account of something so you can still get high variance we still increase the variance by important sampling and it can be can still blow up uh but it's much much better than Monti Carlo now the idea which works best with off policy learning um is known as Q learning this is specific to td0 now um um to Sasa Z for example um and what we're going to do is consider a specific case where we're going to make use of of our Q values we're going to make use of the action values to help us do off policy learning in a particularly efficient way that doesn't require important sampling um and so the idea is very simple the idea is we're going to select our next action um using our Behavior policy okay um but we're also going to consider some alternative successor action that we might have taken um had we been following our Target policy so example this is our random variable representing the real thing we took in the in the world that's time step T plus1 and this a prime is like our alternate thing like we're just imagining if I'd taken this thing under my target policy you know what what what might the world have looked like and now all we're going to do is we're going to update our Q value for the state we started in and the action that we actually took we're going to say well the value of that state in that action we actually took we're going to update towards the value of alternate action when we bootstrap we bootstrap from the value of our alternative action because that's the thing that tells us how much value we'd actually have got under our Target policy when we sample from our from this guy that tells you something about what really would have happened under our Target policy so it looks a bit like this we now replace our sassa update where we have our Q value for sna being updated a little bit direction of the reward that I actually um saw so I was in this state I took this action for real even though it wasn't the one that I would do now it doesn't matter we're just asking how much value would I get if I took that action and going to update it a little bit in the direction of the reward plus the discounted value at the next state of my alternate action under my target policy now this is what would have happened if I was really following the policy I care about so this is a valid update this is the bman equation again it's completely valid we the Bellman equation for Q doesn't require us to import in Sample we're sampling that kind of bman equation that's the idea of Q Landing yeah can you just be clear like which parts are from your behavior that you stored and which is online in that last equ um nothing is stored I'm not sure so where the something at plus one from so you're sampling at plus one from your behavior policy and you're sampling a prime from your target polic so each time you get to a state you're considering the thing you actually took um and you're also that's generating a real Behavior the thing you actually took um and that's the trajectory you're actually following but every step you're also considering some alternate action you might have taken and that's the thing you bootstrap from you bootstrap from that Q value now there's a special case of this which is the well-known Q learning algorithm so if you hear about the the Q learning algorithm um essentially is what we're going to talk about in this slide um and it's a special case where the target policy um is a greedy policy so this is the special case where we're trying to learn about greedy Behavior while we're following some exploratory behavior um so and the special thing about Q learning is that both the behavior and the target policies can improve we allow Improvement steps to both of them um but what we're going to do is basically at every step we're going to make the target policy greedy with respect to our value function so we're going to ask about this deterministic thing which really acts you know as aggressive as possible we're trying to learn aggressively about what we consider to be the best behavior given our Q values but our Behavior policy is going to be Epsilon G so Behavior policy is roughly going to follow sensible things to get us to good parts of the state space but it's going to explore a little bit as well to make sure we get to new things as well um and when we plug this in to the previous slide um we basically end up seeing that so this is our the target we had in the last slide our R plus Q the Q value of our alternate action our alternate action um is this argmax the greedy policy so when we plug in this argmax if you basically want to know what's the Q value of the argmax of your Q values that's the same as the max of your Q values so it's basically what Q learning does is it updates a little bit in the direction of the maximum Q value that you could take so that looks like this so we're just going to have one um diagram to display this um this is the wellknown Q learning algorithm I call these Sasa Max updates because youing s a r s Prime and then the max of your a primes um and what we're doing is we're maxing over these guys we're updating our Q values a little bit in the direction of the best possible next q value you could have after one step that's the idea so it should be familiar from the Bellman optimality equation this is basically um just like bman optimality equation and this algorithm again converges to the optimal action value function um so it's a standard idea for reinforcement learning um so just to wrap up um I just want to um bring out these relationships between these full width updates that we had in dynamic programming and what goes on in temporal different learning um so in this column here what we've got is a dynamic programming algorithm that we looked at in previous lecture on the right hand column are these sample based algorithms TD and algorithms that we've looked at in the last lecture in this one okay um and so if you use the Bellman expectation equation so this is which Bellman equation use if you use the bman expectation equation for the value function V for the state value function um you can you can use dynamic programming to evaluate that your current policy or you can just sample this thing and take just one sample of this diagram by sampling from the environment and sampling from your policy and that gave us TD learning so these guys are just a sample of these guys wherever you've got an expectation here you sample it to get your TD learning algorithm here and that gives you your target so the target for your TD learning algorithms are samples of the right hand side of the Bellman equation um when we do the Bellman equation for Q pi um we can actually do um Sasa now so so when we use the Bellman equation BMA expectation equation for Q Pi we can use that to evaluate our Q values and then plug that into our generalized policy iteration to make our policies better and better and better so we can either do that using a policy iteration framework for dynamic programming or that gave us the Sara algorithm and then finally we can use the bman optimality equation the qar and when you use the bman optimality equation um you can either plug that into value iteration um which is the dynamic programming method or you can just sample this sample your expectations to give you the Q learning algorithm okay so TD we can think of as samples of the bman expectation equations or bman optimality equations they're kind of doing a one-step sample of what would happen if you were doing dynamic programming so that's just kind of to tie things together okay thanks everyone next week we'll talk about function approximation how to scale up how to make these things work in larger and more practical examples cheers guysin some sense you know everything in the course up to this point has been leading to this lecture okay we're going to finally find out how if you drop your robot or agent into some unknown environment and you don't tell it anything about how that environment Works how can it figure out the right thing to do how can it maximize its reward in that environment how can it figure out the actions that lead to the most possible future reward um without telling it anything in advance about how this environment Works um so that's the goal of this class um model free control and after we've seen this class in the subsequent lectures we'll look at how to scale up how to get to more realistic large scale domains but the the core techniques are essentially building on what we saw in the last lecture and bringing those into a control setting now so roughly the proceedings for today we'll have a brief introduction and then what we're going to look at is two different paradigms for doing control on policy and off policy um which we'll understand the distinction between those um essentially on policy means learning sort of on the job um whilst you're following the behavior that you're learning about and off policy means learning whilst following someone else's behavior um and so we'll start with on policy which is a simpler case and again just like last lecture we're going to follow the format of starting with Monte Carlo control before we move on to temporal difference learning to get more efficient lower variance techniques um so how are we going to do that well um last lecture we basically built up the toolkit that we're going to need um last lecture we saw how to do model free prediction how to evaluate a given policy if you drop your robot into a given mdp and you ask it how much reward will it get if it follows a particular Behavior last time we saw how to do that using Monti Carlo evaluation and and temporal difference learning um and that just helped us to estimate the value function of some unknown mdp and now that basic toolkit is what we're going to use to now do control methods where we want to not just estimate the value function but actually optimize it find V Star find qar find the best possible U Behavior find the most reward you can extract from this MVP but we're going use the same tools and we're just going to iterate them and essentially use what we had before kind of as an inner loop for getting to the best possible behavior um so just to get you back in the flavor I know we've had a um a week off reading week so so I think it's helpful just to get um a reminder of why this is interesting you know why should we care why should we care about this particular flavor why why this problem of unknown mdp reinforcement learning and the answer is that in most interesting problems and I put up just a few interesting problems here that that you know fall under this umbrella so you know controlling an elevator getting some car to park itself um trying to automatically steer some ship or control a bioreactor or um navigate a helicopter or do the logistics for an airplane or play RoboCop soccer or control a bot in Quake or manage a financial portfolio or learn how to fold proteins or figure out how to make a robot walk by itself or to play a complicated game um all of these problems um have mdps there are mdp descriptions of these problems in each case there's some complicated environment in which this problem can be described where there are some underlying Dynamics and some environment that has some state that um tells you how it's going to move around in the state but either um that thing is unknown to us we don't know what that environment is and so we have to resort to model free control to figure out how to do this um or the mdp might be known but it just might be so complicated um that it's actually too big to use except to sample it and if you can only sample it you're essentially back to the same case as model fre control where where you really the only way you can access this model is by just trying things trial an error and see what happens because you can never do even one step of integrating over some vastly complicated model might be too expensive um you know the the real world's an example where you know the the underlying dynamics of the real world are extremely complicated even if we knew them you wouldn't really want to work at the atomic level and you know integrate over the the Dynamics of every atom moving around you would just take samples see what happens and work with that so model free control helps us to solve problems like these so that's why this lecture is interesting I hope okay so we're going to look at this basic um dichotomy between on policy learning and off policy learning so let's just understand what that means now so on policy learning I like to call it learning on the job so it's like you've got some policy you follow that policy and while you're following that policy you learn about it um so basically you actually a sampling experience by sampling actions from some policy pi and at the same time you're trying to evaluate that same policy that's what we mean by on policy planning that the actions that you take um determine the policy that you're trying to evaluate but that's not the only choice we can also do off policy learning which you can kind of think of as looking over someone's shoulder so you might have um you know some robot which is watching some other robot walk around and it's trying to figure out the optimal Behavior by just observing some other robot or maybe even a human behave so so maybe a human gives some example demonstrations um maybe some teleoperated arm and the human says this is how you should move this arm around um and now just from the trajectories that have been sampled from the human behavior um we could actually do off policy learning and we could learn to evaluate um the robot could start to learn for itself well what would happen if I did something else you know maybe there's some better Behavior and the robot can learn about that better behavior from samples of experience drawn from say the human behavior you don't have to learn about the behavior that you sample you can learn about other Behavior that's called off policy learning okay so we'll start with a simpler case on policy learning um and the main idea that we're going to use throughout this course is the same idea that we saw um a couple of um Leo when we're doing dynamic programming um and it's the basic framework of generalized policy iteration so just a quick refresh the slide so the main idea in um policy iteration is to alternate between two different processes where first of all we evaluate a policy um and then we improve that policy and we have this iterative process which I kind of um sketched up on this diagram here where you start with some value function and some policy and every step every time you go up um you basically evaluate your current policy to get you a new value function and every time you go down you act say greedily um with respect to that value function just greedy and so this alternating process we saw when we were dynamic programming actually does Converge on the optimal policy and the optimal value function um and it's just this cycle that can go round and round and round and the question is well what can we slot in for these two processes and what we're going to do today is we're going to vary the choices of what slots into these two pieces here um so to enable us to do model free control rather than in dynamic programming where everything was using full knowledge of the of the mdp model um so so far we've only seen in the dynamic programming examples of these where we our up arrows were some way to evaluate the policy and we saw things like iterative policy evaluation and the down arrows were things like greedy policy Improvement and now we're going to vary those two components and try other choices um so so let's make a first try okay let's say you know we want to do the simplest case let's start with Monte Carlo learning um so we're doing basically um Monte Carlo control that's this section we're in that chapter of the um of the lectures um and and so we're going to start with Monti Carlo control later we'll move on to TD control now if we're doing Monti Carlo control is it possible to just slip that in can we just swap in Monte Carlo policy evaluation instead of our dynamic programming that would be a way to evaluate the policy so remember Monte Carlo evaluation is just you just execute some trajectories and take the mean and that's your value of your of of your value function so what about doing that is our process for estimating the value of the policy what would happen if we just ran some trajectories Ed those trajectories to estimate v um and then applied greedy policy Improvement would that work any thoughts any problems with this idea do people understand what what's the suggestion is that we could take this General policy iteration framework and plug in a model free step for policy evaluation to just run some trajectories in the unknown environment see how well we do use that to estimate the value of the policy and then iterate by improving our policy um evaluate the new policy using by running some more trajectories evaluate our new Policy Act greedy with respect to that run to more trajectories Etc um and according to our um our knowledge of generalized policy um iteration this has to converge on the optimal policy so what's wrong can anyone think there's actually two problems with this okay good so if you're if when you're running your trajectories you follow your your policy then you're not going to EXP okay great so that's one of the two problems um we'll actually come to that one second so um but the point was which is correct is that there's an exploration issue um which is that if you act greedily all the time you don't guarantee that the trajectories that you're following explore the entire State space and so it might be that there's parts of the state space which are really interesting and actually have better um potential that you just never see okay there's one more problem with this any thoughts so that's actually the problem with the second step any problems with this first step with evaluating V in this way yeah you need to run a whole load of Trials to get a good measure of what the best best policy would be any given okay so so the point was that maybe it's slow to do this it might be um you might need a lot of episodes to actually make this work that's true and we'll improve on that by using temporal difference learning later um but let me move on so so so the issue is that we're trying to be model free actually um we want to be model free but how can you be model free um when you're using the value function V if you use the state value function and you only have the state value function and you want to act greedy with respect to your state value function you still need a model of the mdp to figure out how to act greedily with respect to that value function you only know the value of each state and so if I'm in some State and I want to know well which action is best I have to imagine what would happen for each possible action I have to roll forward the Dynamics one step to look at the value function my successor State we don't know those Dynamics we're trying to be model free okay so if you work with the state value function V you always need a model in or in order to do your policy Improvement because you need to have some one step of look ahead to get to your next state so the alternative is to use Q the alternative is to use action value functions action value functions enable us to do control in a model free setting and the reason is that we can do policy Improvement in an entirely model free way like if we have q and we want to do um greedy policy Improvement all we need to do is maximize over our Q values so we've got the Q values the Q values tells us from each state how good is it to take each different action now if we want to make a better policy what do we do well we just pick the action that maximizes our Q values that's the best action it's immediate you don't need a model to roll anything forward it's already there in our in the structure that we've we've created so by caching the values of our actions we um kind of reduce the burden of what we require from our from our our model we don't need to look at the model anymore we don't need to roll it Forward is that clear okay so so let's let's drop that in so what would that look like now so the proposal then is that we have now an algorithm which looks like this um where we start at with our a q value function now an action value function and some policy and every step we're going to do Monti Monti Carlo policy evaluation run a bunch of episodes um at the end of those we've got some estimate of the of our policy of Q Pi so we run a bunch of episodes we take the mean um from each state action pair we look at the mean return to estimate qsa um and that tells us how good that particular State action pair is taking that action from that state what is the mean return that you got do that for all states and actions that gives us a new value function we then act greedily with respect to our Q that's immediate we just take this argmax over our Q values that gives us a new policy we run that policy for a bunch of more um episodes um take the mean of all our state action pairs again and iterate um and again this results in um qar and piar or does it and the issue has already been raised by um someone in the audience um which is we still have one issue which is that we're acting greedily as our policy Improvement step and if you act greedily you can actually get stuck you can actually not see the states which you need in order to get get a correct estimate of the um of the value function so we're basically estimating the value of all state action pairs by Tri and error by running episodes but if we don't see certain States and actions then we won't evaluate them correctly and we won't select them because we haven't see how good they are um so that's the distinction now that we're running things we're not doing these full sweeps that we were seeing in dynamic programming U we're running things by actually interacting with the environment so we need to make sure that we continue to see everything so here's an example to make that a bit clearer okay um so this is something from the academic world where you can imagine there's some situation in life where you've got two doors and you're trying to pick the best door so in this cartoon one of them behind that door is tenure or one of them has tenure and one of them is flipping burgers at McDonald's um so so now we have to basically figure out by trial and error which door to go through so let's see say you get repeated trials and let's say the outcome of these doors is actually stochastic okay so you're just going to get some immediate reward this is called a the Bandit problem we'll see more examples of this in the subsequent lecture when we focus really on exploration but it's a bandit problem you've got two choices um door door one or door two um so what happens now if you open the left door um and you see a reward of zero okay so you get this reward of zero we going through the left door um so now think okay zero wasn't so great maybe I'll try the right door so you open the right door and you get a reward of plus one so if you're acting greedily there's only one logical thing to do which is to open the right door again let's say it gets a um a reward of plus three um so now your mean um after these steps you've seen 1 plus 1 1 plus three so your Monte Carlo estimate of the value of that door is now plus two the mean of those two um episodes that you've seen so plus two is clearly better than zero so if you're acting greedy you would pick it again um maybe you get a reward of um plus two again um and now you've got a mean of two um so you'll open it again um and then maybe your mean is just going to fluctuate a little bit around two let's say you always get a reward between say one and three um you'll keep on opening this door forever and the problem is that actually we don't really know what the value of this door is we've only tried it once and so you need to carry on exploring everything to make sure that you understand the value the true value of all of your options if you stop exploring certain actions you can end up making incorrect decisions getting stuck in some local um incorrect Optimum your beliefs can be incorrect and you need to carry on exploring to make sure that you really understand the values correctly okay so to people see how you can get stuck in this situation and just um keep on Mis evaluating so how do we do that how do we make sure that we carry on um well there's going to be a whole lecture on this but we're going to start with the simplest possible way to guarantee um that you uh Vis continue to visit all states and all actions Ely often um and so this is the simplest idea for ensuring continual exploration it's actually surprisingly hard to do better than this simple idea um although there are lots of more very U much more sophisticated algorithms and the simplest idea is what's called Epsilon greedy exploration and it's very simple all it says is that you flip a coin um with probability Epsilon um you choose a random action completely random a uniform random action with probability 1 minus Epsilon you choose the greedy action okay so you flip a coin if it comes up heads you act completely randomly to make sure you see everything if it comes up taals you choose the best thing that you you you know so far you take your your AR Max over your Q values so it's just like our greedy um policy Improvement but we've softened it a little bit to make sure that with some small probability you try all the other actions um so this might seem naive but it has some nice properties it guarantees that you continue to explore everything and it guarantees that you actually improve your policy as well as we'll see shortly so this is just a fancy way of saying precisely that you know this is saying the probability is if you flip the coin um so this is this Epsilon over m is the probability of what happens if it comes up tals and you you explore um and this one minus Epsilon um is the probability of actually if he comes up heads pick the greedy thing but you might also pick the greedy action just by random as well if if this tail thing comes up which why I have a bit more probability Mass on this one but the idea is very simple you just either take the best action or you explore it random okay so one of the reasons that Epson greedy is is a nice idea is that it actually guarantees that we get a step of policy Improvement so remember for this generalized policy iteration idea we want to alternate between steps of improving our policy and steps of evaluating our policy and so what we can see is that Epsilon greedy actually is a policy Improvement like you you start off with one Epsilon greedy policy and now the question is if you evaluate that policy so you start off with pi that's your previous Epsilon greedy policy if you evaluate that to get your V Pi the question is can we be sure that by taking Epsilon greedy step like by acting Epsilon greedily with respect to V Pi can we be sure that we've really made a better policy um and the answer is yes um so a simple proof here which basically says that um if you look at so we're just going to do the same idea we saw in the dynamic programming lecture where we're just going to prove this over one step and then we're going to argue that that whole thing telescopes by unrolling it using the B equation so over one step we can see that the value of our normal of our previous policy of our original Exon greedy policy if we take one step of our new policy so we want to show so that this thing here taking one step by new policy is basically um better than than our original policy the new value of our original policy so this is just saying well this is just an expectation um so this is our Epsilon greedy policy um we're going to say for one step there's some probability that we're going to take each action um multiply by the value of that action this is just unrolling the expectation and we can split that into U the probability of taking the greedy action um plus the probability of taking all other every action so this is what happens if you choose to explore and this is what happens if you choose to act greedily um and now what we can say is that well this thing of where you choose to act greedily if you act greedy if you choose the max the max of all your Q values has to be um at least as as good as any weighted sum of all of your actions right the max is better than any weighted sum of your of your actions of your Q values um so we're going to choose one particular waiting that sums to one here um and we're just going to say that the Max has to be better than that weighted sum of your Q values and now you can just collect terms together and we see that we've got 1us Epsilon on oneus Epsilon so this term over here now just cancels with this term over here um and you're just left with an expectation over your Q values which is your original value function okay you can look at this afterwards but the main thing I want you to take away is this idea that Epsilon greedy very simple idea but it really does guarantee that you're making progress so you really do make progress you start off with some Epson greedy policy um you evaluate it you make a new Epson greedy policy and it's definitely better than what you started with or at least as good as what you started with okay and then for the final to telescope it um refer back to um the dynamic programming lecture that explained how how this telescoping argument works it's the same idea okay so that's the one slide of math go on question this um proof you've just shown us doesn't really tell us anything about the rate exporation really because I think if you did like a um completely greedy policy then that would actually you know prove that's better than anything not Tak into account so I'm not sure I the question so the question can we say anything about how how frequently you need to like what like the how Epsilon effects is it telling us anything about um in our real problem when we execute our real problem how much we should be exploring um no this doesn't really in itself tell us anything about how rapidly we should be exploring so um but in general um we'll see a little bit about that in a second that you need to choose a schedule for your Epsilon and you need to make sure that that decays to zero roughly but we'll see um so so let's let's see how to plug that in now so let's let's we've got we've got two ideas now we started with this generalized policy iteration framework and we basically we've changed each of these steps now we have these two processes so for the Pol policy evaluation we're now plugging in Monte Carlo as our as our way to evaluate our policies using Q using the action values so we've got this one process here where we just run some episodes using our current policy um we estimate the value of all states all actions from all states um just from what we've seen so for every state action pair we look at the mean return we've seen that's our evaluation procedure um as someone's pointed out that might be a little bit inefficient and we'll see how to improve that shortly but now we've also changed our policy Improvement procedure to use this Epsilon greedy policy Improvement where we basically have this soft uh this softening of the greedy idea so every time we go down we're not getting to the completely greedy policy we've got a little bit of exploration left and so the policy is always to St Astic along here now and that stochasticity in the policy ensures that we at some rate at least explore everything in the environment now as pointed out by the question there um the rate at which you actually see it might be very very slow still there might be some um states that are way off down some strange trajectory and you have to explore and explore again and explore again to be able to even see those States um so this doesn't actually give you very strong guarantees of how long it takes to explore all of those States it just says ASM totically at least this thing this thing now really will um finds the optimal policy at the end which should say Q star this should all be Q rather than V okay so let's try and make this a little bit more efficient so the first thing to noce um again we saw in the dynamic programming lecture that it's not necessary in these kind of policy iteration Frameworks it's not necessary to go all the way um to the top of this line every time it's not necessary to fully evaluate your policy um sometimes you can just spend a few steps to to evaluate your policy and you've already got enough information there to guide you to a much better policy without wasting many many more iterations on Gathering more information um so what would that look like in the context of Monty Carlo well let's take it to its extreme and say why not do this every single episode so we're going to run one episode going to make the robot do one episode collect all the steps along that episode update the Q values just for those steps so basically get one return and so for every state and action that we've taken along that return we're going to update the mean value just of those visited States and tried actions along that episode so one episode one sequence of updates to our returns um and then improve our policy straight away why wait why wait to get more episodes of information when you can already improve the policy so the idea is always to act greedily with respect to the freshest most recent estimate of the value function like if after one episode you can already update the value function to something slightly better then why continue using your old estimate of the value function to generate Behavior you may as well use your new updated value estimates to continue generating behavior and so we basically we Chang the sort of um the the time the rate at which we um operate in this diagram becomes more frequent the frequency increases we don't go as high up we just run one episode um and then we immediately improve the policy one episode immediately improved the policy run one more episode improve the policy is that clear okay so this question's already come up and it's a natural question which is how can we really guarantee um that we find the best possible policy like what we really desire is pie star we really want to know the best possible behavior in this environment um so to do that we have to kind of balance two different things we need to make sure that we continue exploring to make sure that we don't exclude things which are better but we also want to make sure that asymptotically that we get to a policy where we're not exploring at all anymore because we want the best possible policy and the best possible policy will not include this random behavior that is extremely unlikely to be optimal um so how do we balance those two factors um and so one idea for balancing those two factors is called this um Glee idea greedy in the limit with infinite exporation and the idea of Glee is basically to come up with any um schedule if you like for for exploration such that two conditions are met the first condition is that you continue to explore everything that you basically make sure that all states and all actions are visited infinitely often um so that's like guaranteeing that you just um that you never miss out on anything that that just to make sure that every little part of the state space will be seen um every action from every part of the state space will be tried um so for example Epsilon greedy has that property that eventually if you behave randomly and you you try all possible actions then every um every reachable state in the state space and every action from those reachable States will be tried okay um and the second property is that we want to make sure that the policy eventually becomes greedy and needs to become greedy because we need this thing to satisfy the Bellman equation and the Bellman equation is basically the bman optimality equation basically requires a Max in there not just some soft thing and we want to make sure that eventually we're acting greedly with respect to Q we're really taking the max of the arax of our Q values eventually and so one way to achieve this by no means the best way but but certainly one simple idea um is to choose an Epsilon greedy policy and just Decay Epsilon slowly toward zero um so for example if you choose a hyperbolic um schedule so each episode you basically on the Ki episode you set your Epsilon to say one over k um that satisfies these two properties that you will see everything infinitely often um but asymptotically you become closer and closer to to Optimal closer and closer to the greedy policy okay so that's one way to balance things so what does that look like in practice um well we can make an algorithm now um it's called this Glee Monte Carlo control um so this algorithm basically we start off by sampling episodes um so again we've just got our robot we throw it down we run one episode um that generates a trajectory of States actions rewards State action reward until we get to some terminal State um we sampled that from our current policy pi and then for each state in action that I visited so I got to this state and I picked this particular action for each of those States and actions we update um our action value and the way that we do that is just by counting how many times we've seen that state action pair um and doing an incremental update to the mean so this is just our incremental update to the mean which says that if we consider that particular State action pair this one over here this state and this action um how many times have I tried that um well let's just adjust um our previous me which was this we're going to update a little bit in the direction of the return that we just got and the amount that we have to adjust it to get a correct mean estimate is this one over n now what's interesting about this is we're not taking a mean um in the way that you think of a it's not like a statistical mean um over um some IID quantity now that the policy is actually changing over time we're iterating over our policy we're improving our policy but we're basically taking return returns from better and better policies into account so as we improve our policy we're Gathering more and more statistics our policy starts to get better and we collect all of those statistics together to get us one overall mean of how good it is and the Glee property basically ensures that over time that these statistics that we're collecting really sort of converge on getting the mean return for um this these policies sort of get more and more like the greedy policy and so we basically find out more and more information about the policy that we really care about that the past kind of gets dominated eventually by um by these new policies um and so what we're going to do now is we're going to iterate over this whole process so we're going to sample the Casi episode in this way um update our Q values and then improve our policy so this was the policy evaluation step now the policy Improvement step just looks like this where we can for example set Epsilon to its new value one K and now we're just going to act Epsilon greedily with respect to these new Q values and those Q values are going to be the same everywhere apart from the states and actions that I just visited so we only update those Q values along one trajectory but that's already enough to change the behavior that we we actually see there so in practice if you're actually writing this you know you don't need to explicitly store Pi you just store q but when you actually come to select actions you always just look at the freshest most recent q and you flip a coin and you either act um you either pick the maximum Q or you pick a random action um so this Pi is kind of implicit and all you need to remember is your one overk schedule that you're using to determine the bias of this coin that you're using and this algorithm actually finds the optimal action value function so this is our this is really you know this is our first full solution this is something which you throw it into any mdp and you just let loose with this thing um and it will find the right solution and it's considerably more efficient um to run it in this way um where you actually update the Q values every single episode you update your Q values and then immediately improve your policy rather than having to generate a whole batch of episodes thousands of episodes to just get one evaluation of que so this actually works much better in practice okay are people clear about this algorithm like the idea it's a good time to ask questions if you're not because we're just going to keep on adding to these ideas okay good I'll take silence to mean comprehension um okay question and I can't remember if you addressed this in an earlier lecture but is there any particular um idea behind the initialization of the values for Q or do you do random initialization okay so the question is does it matter how you initialize Q um and in terms of the theory you can start with anything it will still converge in practice of course it helps to have a good estimate of Q the closer you are to the optimal Q values the faster this thing will will um will converge um it's a bootstrapping uh sorry it's not bootstrapping like with TD but still you um would basically starting from these Q values and we're we're updating incrementally from um ah actually sorry sorry that's not true with this particular schedule let me let me back off that we'll see that for later versions where we have an incremental update um so for this particular algorithm it makes no difference because we're just taking the mean um and and so the very first time you see a new state action pair you will replace any Q value that you started with completely um because n will be one you completely replace your old Q value with the with that return that you saw right so for this particular algorithm it makes absolutely no difference what your initial Q values are in practice um and in the sub the rest of this lecture we'll see algorithm where you don't have a like this perfect mean instead you have some non-stationary estimator that has something like a constant Alpha here and for those algorithms it matters much more what Q values start with because you're incrementally moving on from that initial start value and then your start value the closer you are to Optimal the better so yeah sorry correct myself okay question then I'll move on you said it's more efficient to upate for every episode yes but um if you do it for many epis you to okay so so that's a great point so so I guess yeah I'm just focusing on serial computation for now um there are many possibilities for exploiting parallel computation and you're right that you would want to that parallel computation introduces constraints on what you can and can't do and um and you may want to it's almost I think the same is roughly still true with parallel computation that you still want to use the freshest possible Q values that you can um but with parallel computation um those might necessarily be a little more stale than the very latest episode to come in because you simply don't have access immediately to all of that parallel computation okay so let's just do a quick example so we had this multical example in a previous lecture um in um last lecture we just looked at evaluation we basically said okay if I send you into a casino and I tell you that you have to um stick um and let um basically unless you're always going to ask for a new card unless you're on 19 or 20 or 21 I think was the policy we had before um and then I want to know how much money would you win or lose in that casino so that's what we looked at before and we had a particular plot of the value function um and it had a particular shape to it and now what we're doing is we're actually running this um algorithm we saw on the previous slide um to generate the optimal value function so now this is the question you probably really want to know which is you know if I walk into a casino what policy should I play to get the maximum possible amount of of of reward um and again real casinos use slightly different rules don't try this um but I'm not responsible if you lose all your money um and and so this basically tells you um now we've run our Monte Carlo we've iterated through several process is where each episode we update our value function a little bit we actually store Q internally but just to make the plots um easy to look at we're now basically summing over all of the actions to estimate varar from from our Q so you basically have some Q um that generates some Epsilon greedy U policy and then you alternate backwards and fors by running these episodes just like we saw in the last algorithm and you end up with a policy which looks something like this U which says that if you don't have an ace in your hand um you should actually have this quite interesting looking policy is the optimal policy to follow where depending on what the dealer has here and what cards you've got you know it's this threshold uh somewhere between like 16 and 17 where if you've got 16 or less it's better to ask for a new card um but there's some situations where a a dealer has some kind of um quite difficult card to play with where there's a high probability the dealer will go bust in which case you should probably just stick anyway and just wait for the dealer to go bust Okay so you kind of vary your policy according to that if you've got an ace in your hand um of course you can afford to to ask for more cards more often because that Ace can either be a one or 11 and then this is what the value function looks like so it's just a continuation of that last example to show you that you know this this process really does end up with usable information in the sense of you know you get these nice um shapes um out of these value functions that really characterize exactly the optimal behavior in this in this domain and you really end up with something which which is the optimal policy there's no way to do better than this um in this particular game that we've described this is the optimal way to behave okay why why would you ever stick at 11 um you can't go bust you wouldn't it always says to to hit on the 11 when dealer is showing between four and six so this line here says if you're on 11 you always hit this here so the line is between 11 and 12 so so it's like a um a grid and each cell of this grid is indicating what to do so this shaded this this could be shaded in if you like sh you for 11 you should always um you should always hit but if you've got 12 and the dealer has a bad card there's some chance the dealer might go bust and so you should just stick on that 12 sometimes okay so we're now going to move on to Temple difference learning methods so just like the last lecture we kind of split things between Monte Carlo learning and then we moved on to TD learning and then we saw that there was a spectrum in between these two things we're going to follow the same process now but with control methods rather than evaluation methods we're really going to try and understand how can we actually gain efficiency by bootstrapping by using the same tricks that we saw last time um and gain a lot of the efficiency um we want to basically use TD why because it can be lower variance um because it's can be uh Run online um in continuing domains even if there's no ter you can still run this it can be run from incomplete sequences and what we'll see actually in this lecture is that there's an additional benefit using TD which is when you use off policy learning it becomes particularly important to use TV um and so the natural idea well let's try the obvious thing which is use the same generalized policy iteration strategy um but we we know that to do model free policy iteration we need to use q u we need to use Q so that we don't have to do any look ahead in our um to do our greedy or Epsilon greedy policy approvement and the idea is basically to use um now TD learning to estimate Q so let's just slot in TD learning um in place of Monte Carlo policy evaluation and continue to use Epsilon greedy policy Improvement and that's immediately going to give us an algorithm um that we can apply U which is probably the best known algorithm in in reinforcement learning um and the only difference we're going to do is because we're TD is can operate at this higher time scale so TD learning is something where you don't need to wait until the end of the episode to update your value function in fact you can update your value function after just one step that's one of the advantages it's online like you see one step of data you bootstrap you update your value function immediately and so again if we follow this General principle of always using the freshest most recent value function to to um pick your actions what we're going to do is we're going to increase the frequency of our policy Improvement to be every single time step we're going to improve our policy okay so the general idea is called Sasa um so why is it called Sasa well this diagram should illustrate it we're basically starting off uh in some State action pair that's this black dot a decision node we're basically choosing an action we're going to randomly sample up policy um we're going to see a reward R we going end up in some new state S Prime and we're going to our policy again to generate a prime so there's two steps of um of sampling sorry we're not sampling our policy yet so we start off with State here um we're asking a question about this state s and this particular action a we're going to sample from the environment to see what reward we end up in and what state we end up in here um and then we sample our own policy at that next state so we end up with s r s prime a prime or Sasa algorithm okay so s Sasa basically indicates a particular update pattern that we're going to use based on Sasa here um and so what do we do we're basically going to start with our Q values we're going to move our Q value a little bit in the direction of our TD Target reward plus our discounted Q value of the next State minus the Q value of where we started so again it's like I was in some State and I'm considering taking some action and what I want to know is um if I actually take that action look at the reward I got and then the value of the next action which I would take that gives me an estimate of the value of this policy and I'm going to use that estimate to update the value of the state action pair I started in okay that's the general idea this comes from the Bellman equation for Q U we've seen this hopefully before becoming familiar with these type of ideas so that's a sassa update okay um so now we're going to take these Sasa updates and we're going to plug them in to our generalized policy iteration framework so every single time step we're going to move up and down in this diagram every single time step so each time we take a step we're going to update our value function by applying one step of Sasa so for the state and action that we were just in we're only updating the value function for that one state action pair of that time step by applying this update so I was in this state action pair I ended up in some new state action pair I'm going do One update to my Q value for that single entry of my table um and now I've already changed my value function like something's different if I end up back in that state action pair again which I might in some loopy environment I want to make sure that I use the latest most interesting um most upto-date best information um that I've got from my my policy evaluation so every single time step we're going to to improve our policy as well by acting Epsilon greedily with respect to the latest value function so again what does this mean in practice it means that you know you're just thring Q in your memory you've just got this big table of of Q values for all states and all actions and every step when you actually come to take an action you just flip a coin you look at your Q values um and if it comes up heads then you look at your Q values and pick the best one if it comes up Tales you explore randomly um so the policy is implicitly represented by your Q values and every single step we're going to we're going to evaluate the latest action I took and update my policy just for that one state action pair okay so we really got this very rapid process of evaluating basa and improving the policy okay let's try and get to know this a bit better so what would an algorithm look like so I think it's actually useful in this case to just maybe step through some pseudo code it's really straightforward um so you can arbitrarily initialize this lookup table Q is just a lookup table in subsequent next lecture actually we'll see how to generalize Beyond these naive lookup table representations but for now it's just a lookup table every state every action has its own entry in this table we initialize this arbitrarily just say zero okay and now we're just going to repeat and every single step this outter Loop is just over episodes but really you know the inner loop is just every single step we're going to um take an action observe the reward observe the next state that we end up in um we're going to choose our action using our Epsilon greedy policy um and we're just going to apply this um Sasa update to that one step so update a little bit in the direction of my TD Target the reward plus the discounted value at the next data action pair um and then repeat so the next time we come around we'll already have a different policy because we're acting Epsilon greedily with the respect to our Q values if I end up back in that same state again for example in a loop I will already behave differently using the latest and freshest information stored in que okay yeah question s Isen ACC to as well s Dash is chosen Accord by the environment s Dash is the next state um so States sorry yes a so a prime um so the question was is a prime selected according to the same policy um so so yeah so a prime is selected using the your current policy and then you basically remember that and now when you come in you remember what your last a prime was that becomes your a at the next step you just kind of um remember your last a prime and that becomes your new a how does that the that you're choosing that I mean there's no reward coming from that last right the reward the that you're using there is from yeah um so the question is you know why does this even make sense as an as an update I guess um and so let me try and give some intuition and I think the right intuition to have is comes from from the Bellman equation that really what we're trying to do this is just a sample of the bman equation this this right hand side that we're moving towards this this thing here is Target um is a sample of the bman equation we're basically saying what we want to do is we want to consider an expectation over everything which happens in the environment over one step um which basically is the reward um plus whatever state you end up in next so that gives you your expectation of the environment and then this um and now we want to know what would happen under our own policy we want to know what would happen in under our own policy after that point we want the expectation under our own policy from that state onwards and that expectation under our own policy is precisely what we're trying to evaluate with Q so that's why it has to be evaluated we have to select actions using the policy that we're trying to evaluate this is an on policy algorithm fundamentally on policy uh we're selecting actions and we're evaluating that policy so Sasa is an on policy algorithm okay so you should be wondering well does this work um and just like Glee uh Monte Carlo um this version of Sasa actually will converge to the optimal policy and the only thing we require is again Glee policy so again you could choose your Epsilon greedy with this decaying schedule that would be a valid choice just to make sure that you continue to explore everything and that you eventually end up greedy um but you also need to be careful about the step size that you use so you have to make sure that your step size if you really want convergence you need to look at stochastic approximation Theory and you need to make sure that your step sizes basically follow these two conditions um and all this tells you is that the step sizes are sufficiently large that you can move your Q value as far as you want so you might need to move your Q value from whatever your initial estimate was to some very very large or very very low value if you're wrong and this thing basically tells you that um eventually the changes to your Q values become smaller and smaller and smaller the changes to your Q values have to vanish and become zero eventually otherwise you'll still have noise and jumping around in your policy so that's just St castic approximation Theory but really when you do all that if you do all the Machinery correctly this really converges in practice we don't worry about this and we sometimes don't worry about this either and Sasa typically Works anyway that's an empirical result um but this is the theory okay so let's have an example so this example actually is doing precisely that is throwing out these this Theory and just using a fixed step size um and a fixed exploration rate let's see what happens um so this is called a windy grid World um we're just trying to wander around from this start point to this goal um we can take um any of these kings move operations um so all the diagonal moves as well as NorthEast Southwest U and the only special thing about this grid world is that every time we take a step the wind pushes up up pushes upwards and pushes the agent upwards some number of grid cells which are indicated by this number at the bottom of the column there's no wind over these first three columns then you get pushed up one step when you move here two steps when you move up here one step here and zero here um so the optimal behavior is basically you has to take account of the fact you want to make as much progress as you can when there's no wind and and uh I think you end up kind of taking this route we'll see on the next slide something like that as the optimal for heer you have to come back around to get back to the goal would you want to just go down and then be blown up to the goal um that's not optimal um if you try to go down then what happens is um you end up being blown up one blown up one again blown up one again blown up two you end up overshooting the goal okay so um so you can try to do that but you don't hit the goal so you're better to get pushed up against the ceiling um come back around from the ceiling after that point and come back back to the goal again um but if you can go diagonally downwards diag to the right then those first three you can just go hor the one sorry the where you're blowing up one um so maybe this is a version that just uses the standard moves then I'd have to check that I think this might just use standard moves and that that wind blows you into a diagonal Direction you're right that would be two in that case so here is the diagram op policy so this plot basically shows what happens if we use the naive version of Sasa so it's just something to show you that you know this is the policy that it finds which I claim is the optimal policy you can check that offline um and and this plot here basically shows the the learning curve for running Sasa um so this is exactly the algorithm we just looked at um and what this is showing is actually um the number of time steps so this axis here is literally time step by time step and what we're looking at is how many episodes were actually completed in those number of time steps so you can kind of see here that the first episode took 2,000 time steps to complete it started off with no knowledge about how to do this and so it's basically just doing a random walk and that random walk is biased upwards so it kind of it's not even as good as a random walk so it takes a long time to actually stumble on the goal um but when it does and this is really typical of reinforcement learning and and these generalized policy iteration algorithms when it does the next very next episode it does much much faster and you get these exponential increases because once it learns some knowledge it can boot bootstrap from that knowledge and do better and better and better and better as it starts to learn things so it immediately learns faster um and starts to complete episodes at a much faster rate now um until up here you can essentially look at the gradient of this and the gradient of this should tell you the the path length which is taking taking which I think is like 14 or something the ratio between this and this which is the optimal policy so most of the time it's behaving optimally but this is still because we're not decaying Epsilon um it won't be completely optimal because it will still be throwing in the odd exploratory action okay and that's why it's got this bubbly look to it okay so now we're going to take the step that we took in the last class which is we're going to consider the spectrum between Monte Carlo and TD learning algorithms we're going to consider these eligibility traces and Lambda variance of these algorithms to see if we can again get the best of both Wells get the best of um of the unbiased behavior that you get from Monte Carlo and also get the best or at least control the bias have a have a knob that controls the bias variance tradeoff um and so the way we're going to do that is by start off by considering these endep returns again um and so what we're going to do is consider these um U returns now um which are basically um the one step return looked at one step of real reward and then our Q value after one step but we could also use the two-step return we could have looked at one step reward here a second step of reward here um and then look at our Q values after two steps or we could do that for any number of steps all the way down to Infinity which is essentially Monte Carlo where we look at the reward after one step reward after two steps all the way until the end of the episode and we never bootstrap from our value function okay so that was the idea of these endep returns um and so this Q return we can Define in the following way we can Define the Q return um to basically be the generalization of this to any n and nstep Sasa just does the obvious thing where we say for each state action pair what we're going to do is instead of updating towards the one step um estimate of our value function using um Sasa we're going to use um this endep return as our Target so again this means we're going to take our estimate of the value of taking action a in state s we're going to update that a little bit in the direction of our endep Target okay this is this sort of idea of increment will mean it's nonstationary updates again um and this endep target now is you know essentially saying well I'm going to look at you know one two three steps and then bootstrap from my Q value which represents the value of the remainder of the trajectory and can use that as my estimate of the overall value of the trajectory and use that to update my value of the state I was in and the action I took from this starting point and we can do this for any n um and so again just like in the last lecture we want to be able to um consider algorithms which are robust to the choice of N and which average over many different NS you want algorithms which kind of get the best of all possible n yeah so I want to ask just for reminder so the Q function represent like the the estimate of of the whole trajectory from s using action right yes not the not TR only the one step yeah that's right so so all value functions so so let me be clear about that so all value functions including V and S um are long-term estimates of the amount of reward that you get so the definition of of of Q is the expected total reward that you get um if you start in state St and then you take action at and then you follow your policy for all subsequent actions that's the definition of q um so so specifically as a long-term estimate um we never consider um just these like even when we talk about these endep returns um that's just our our Target it's still an estimate of the complete return um but we're estimating that by looking at end steps of immediate reward plus something which summarizes the the remainder all the way out to infinity or the end of the episode um of the remaining reward all discounted by the appropriate discount each St okay people good with that right so let's make a Lambda version of this algorithm so this algorithm is called Sasa Lambda very well-known algorithm U so what we're going to do is we're consider first of all just the update like what does this look like before we talk about the full algorith um and the idea is we're going to basically average over all of our nstep um um returns so we're going to consider starting in this data action pair um the environment pushes me to this state I take some action um and then I estimate my Q value here um and we're going to wait this um particular return by 1us Lambda we're also going to say well what if I start in this state action pair environment picks a state I pick an action environment picks a state I pick an action we're going to look at the Q value add up the rewards along here plus the Q value at the end of this weight this by 1us Lambda time Lambda so each time we're going to multiply by another patch of Lambda this oneus Lambda is just like to normalize the whole sum to add up to one um so we get this weighted sum of all of our endep returns where we basically are looking um we take more account of of um the trajectories which are of the shorter n um and then we progressively discount by another value of Lambda per as we increase in so so the main idea is to now make this average return so we had this uh these returns which were endep returns and now we're going to average over all end and the way we're going to average over them is just by defining this waiting just like over here we're going to wait each each endep return by a factor of Lambda to the nus one just normalizing Factor there um and so when we make this Lambda return this is just a way to average over all of our endep returns in some way that's a computationally convenient um and um B takes account of all different n okay and it has this Lambda parameter that we can tweak which can make us either more or less farsighted in terms of how how much we prefer the large n or the short n okay and so this gives us now we can plug this right into our um into our update um and so again what we can do is we can say now I'm going to update my Q value you know I start over here and now I'm going to consider a weighted sum for this Q value I want to know how good is it to take this action and I'm going to consider some weighted sum that takes account of my estimate based on my Q value my rewards plus my Q value from here but also based on my reward plus my reward again plus my Q value from here onwards we're going to weigh all of these things together all the way up to n equals infinity um that's going to give us our Q Lambda um and we're going to make that our Target and we're going to update our original Q value a little bit in the direction of that Target okay so that's the idea of Sasa Lambda um we'll see an illustration in two slides um just before we do that I want to spend one slide just saying that this was the forward view if you remember from the last lecture there was a forward View and a backward view of how to use TD Lambda that with these eligibility traces um we can actually um come up with a back so so what's the problem with this this approach so far so this is great we've got something which lets us build a spectrum between Monte Carlo algorithms and and um TD algorithm so we've got now a variant of Sasa that can look um all the way out to the future if we set Lambda to one it turn turns into Monte Carlo learning if we set Lambda to zero it turns into our familiar Sara that we saw in the previous steps so now we've got this this control where we can choose you know our bias variance trade-off in between and and average over all of these things um the only problem is that we're kind of looking forward in time that um this isn't an online algorithm it's not something where we can take one step update our Q value and um immediately improve our policy because we have to wait until the end of the episode um to be able to compute our Q Lambda and we would like to be able to run things online um and be able to get the freshest possible updates updates immediately um and have this kind of plot taking over every step of getting the maximum amount of our policy Improvement by making an adjustment every single step running online we don't want to have to wait until the end of the episode there might not even be an end of the episode this might go on forever um so what we're going to do is come up with an equivalent just like in the last class um by building up eligibility traces such that when we sum up over those eligibility traces we're going to end up with an update which is exactly equivalent to this one um and so the idea is very similar to the last class U we're going to build an eligibility Trace now um but the eligibility Trace um is going to be for All State action pairs so now we've got a new table a table of our eligibility for each state action pair think of this is the thing which is telling you how much credit or blame you should assign to every action you took from every state like you end up at the end of your episode you know you get a big carrot at the end of your episode um so which of the states and actions are responsible for that carrot and your eligability Trace is your best estimate of who was responsible for receiving that carrot and so what we do again is we say that the states and actions that I took that were most recent before I got the carrot um and the states and actions that I took most frequently along the trajectory are the ones which should be blamed or or credited the most for their negative or positive reward that I receive at the end of that trajectory and so the way we do that then is we can do it online by just increasing this state action pair so every time we actually visit a state action pair um so if we actually visit it um we increment it so this is just something that says this is an indicator say if I'm actually in that state s and take that state that action a um then I'm going to increase this eligibility Trace by one okay so if I actually see that particular State action pair increase my Trace and then every step for All State action pairs even the ones we don't visit we're going to Decay them a little bit over time you kind of have this process where you know if I'm going around and this is the state um that we're really considering here and if I keep going around and every time I revisit this state I'm going to bump up my eligibility and over time it's going to Decay until I get back there again and it's going to bump up again and then if I receive my carrot whatever the latest eligibility is that's how much blame I'll put on that particular State and so what do we do then when now we've got these eligibility traces um we just update our Q values um for every state in every action we update in proportion to the TD era that's the the event that that happens of getting the carrot or the surprising events where you got more or less reward than you expected and the credit or blame that you assign is proportional to your eligibility Trace so now the algorithm is to say let's update our Q values in to the TD error the difference between what I thought was going to happen and what actually happened um multiplied by the this credit assignment choice this eligibility Choice Yeah question um is there a way to apply this to very large State spaces where you don't effectively ever visit the same space twice where you're using function approximation for your value functions so the question is can we do this in very large State spaces where you can't store this table next class so next class we deal with function approximation um for all of the algorithms we've seen so file we will generalize to function approximation in the next class and so the answer is yes you can you can do that and should do that because table table lookup is typically naive and you can't solve large problems until next class and then you'll be able to so Sasa Lambda um it looks something like this um so we're going to basically initialize our eligibility traces to zero at the beginning of each episode you want to say well I can't blame anything yet I haven't seen anything and then for each step of the episode um we we again we take our actions using our policy we're again on policy here so we we pick our action from our policy for example acting Epsilon greedily um we compute our TD eror so we look at the difference between the reward plus the U the Q the value of the state and action I ended up in minus compared to my previous estimate so this is just the one step errow between what I thought the value was going to be before and what I think the value is going to be now um we increment our eligibility Trace um we decay all our eligibility traces down here so here we're just incrementing the state action pair we actually visited but here we're decaying it for all states and actions and for all states and actions not just the one that we visited now we do an update because um we need to update everything um in proportion to its eligibility not just the thing which we visit at this step everything might be blamed now or credited for what happens so we need to update all state action pairs we update them all in proportion to the TDR and the eligibility trace and we just iterate every single step um so what does that look like um so I think this hopefully will help to get some flavor of what the difference is between Sasa Lambda and and standard Sasa so imagine you're in some grid Wells where you're trying to get from here um to this goal State um this could be like the windy grid well we just looked at and you do some random walk to begin with and eventually you end up there okay and now let's say you start off initializing all of your state action values to zero okay and now what we're going to do is indicate the value of those State action pairs by an arrow in the direction of that state action pair and the size of that Arrow indicates how big our Q value is for that particular action so what would the updates look like well if you just did one step Sasa um you get to the end here you started off thinking you had a value function of zero everywhere you get to the end and you're like aha I actually got a reward of one once I reached there um so surprise um now you need to make some updates to adjust your the values of your trajectory towards what actually happened but in Sasa um if we look at the updates which are made at every step at every other step apart from the final one like here we thought the value was Zero after this step we also thought the value of zero because we initialized all our Q values to zero so nothing changes here over here we thought the value of going right was Zero we ended up in a in a state where if you go right you get a value of zero so again nothing changed here the only place you get a change was in this very final step step here where you thought you were in a situation where if you go north you got a value of zero but then you ended up in a situation where you got your your car so now this guy will be updated and you'll end up adjusting this one to say okay now you're going to have a very nice positive increment you're going to increase this value of this action going north to some higher value and you'll think that this was a good action now if you run another trajectory and you end up coming back through that same action that will now propagate backwards one further step so if you if you were over here now you start in this situation where you thought the value of going left was Zero you ended up in a value where you're now estimating the value of going north to be something High um so now you would increase the value of this guy the value of going west from here um you would increase by some large amount at the next step that will propagate backwards at the next episode it propagate backwards again you need a lot of episodes you're only propagating the information back by one step per episode in Sasa zero okay in Sasa Lambda it looks quite different so the Lambda parameter determines how quickly how far that information should propagate back through your trajectory so again if we consider the same trajectory um you would build up your eligibility all the way along this trajectory so each of these State action pairs that you visit would have their eligibility Trace increased um the first ones would have decayed quite a bit down towards zero the ones more recent would have decayed less and now when you actually see this reward of one at the end um you would increase all all of those State action pairs in proportion to your eligibility Trace so that means all of them when you see this surprising positive benefit all of them get updated in the direction of what actually happened so this information flows backwards in just one episode most of the way back towards the beginning of the episode so you get a much faster flow of information backwards through Time by choosing this Lambda parameter so the idea of Lambda um is it um so we can say that it basically defeats the tyranny of the time step that if you if you have many many many time steps Sasa zero is always susceptible to the fact that you need more and more steps to actually flow this information back whereas using a Lambda parameter you can always just pick your Lambda so that information flows back at the right rate regardless of the granularity that you make your time steps operate so Lambda basically overcomes this tyranny of the time step okay any questions there before we move on to the final section yeah said no or they so if we go back to the um to the updates and the algorithm the eligibility Trace is incremented in all the states and actions you visit but it's also decayed every time step it's decayed by a factor of of Lambda and also by your discount Factor gamma um so that's why you this Lambda parameter sets the rate at which you how far back in time you look because now you've got this Decay rate um which basically tells you how steeply um you want yourability Trace to Decay going back along this trajectory so if you set Lambda to one this these will be of equal magnitude like these arrows would still be thick like the updates that you make all the way along the trajectory would still be large and that's what happens in Monte Carlo because in Monte Carlo you run your trajectory you see that you get some eventual reward and everyone gets updated in the direction of that reward because everyone says hey you know I took this action I ended up getting something good I should be updated um so by tuning your Lambda you basically determine how far back along this trajectory you look it controls your bias variance tradeoff the further back you look the more variance you have because there are all these random things which happen along this trajectory um um but you reduce your bias you're less susceptible to um um less SU effect of bootstrapping okay um question that back you said that for you don't need so okay the great question so so the question was you know why wait until um so for the backward view the advantage is supposed to be you don't need to wait until the end of the episode um so we're actually running an update every single step um now the fact that the reward only occurs at the end of the episode um means that you don't actually gain any information until you reach there um so these arrows are basically showing you um by the time you get to the end of the episode what's the effective update that you see um but the nice thing about imagine that you pass through this reward and then you carry on and then you get another reward and then you get another reward and then you get another award um so what would happen with Sasa Lambda is that you nothing would happen nothing would happen nothing would happen until you hit your first reward and then immediately all of these guys would be updated even though you haven't finished your episode you was carry on and you was carry on and you carry on until you hit your next informative event um then everything would be updated because of that next informative event and you would carry on and you'd carry on um so the fact that nothing happens isn't because we had to wait until the end is because actually that we gained no information until this point like until this point we thought the value of zero everywhere and we got zero so we are making updates at every single step it's just those updates contain zero information until the first informative event um but you're right I think this example is is not clear from this example that that um is really a property of of the online algorithm okay there was one more question yeah so uh b2v action values um and over many episodes we're basically going be taking the me of these um so we in in Sasa we never take an explicit we just use these Sor updates um so the the updates that we do are precisely the updates that we saw in the in the algorithm which are basically updating the Q values a little bit um in the direction of um the so if we see some reward where we thought there wasn't going to be a reward that generates an error and we update every value a little bit in the direction of these um of our TD eror multiplied you know our unexpected reward that we got um multiplied by the trace by the credit that we're assigning to each of those States and actions that's it that is the process by which we estimate the mean so there's no additional step of taking the mean um it's an incremental way to estimate the the mean value if you like the expected value um but we're not we're not taking we're not forming explicit mean and furthermore we're forming this non-stationary estimate um and actually that's a point I should make more explicitly which is if we look at the updates that we that we make um so we use this fixed step size Alpha so could you just take the mean by using say a 1/ n like could we do the same thing that we did in in Monte Carlo where you use where you count the number of times you visited that state action pair and plug in one over n um um as your step size that would give you like a mean an incremental way to estimate the mean but the thing that you're plugging into that mean is changing all the time because you're bootstrapping so so it's not really the same as forming a true mean you're forming like a non-stationary mean of that you're throwing in there all of these different targets that you've seen at all these different times so so I think the right answer to the question is that we are forming an estimate of the expected reward expected future reward all the way to the end of the episode um one way to estimate that is using Monte Carlo learning which explicitly estimates the mean another mechanism for estimating that expectation is by bootstrapping um and that's sort of instead of computing the mean we now we now update a little bit in the direction of our of of the errors that we see so we bootstrap we look at the error and we use that error to inform what we think the the previous value should have been and that is the process by which we estimate the expectation okay um and and again this the equivalence between the backward and forward is not obvious but it is possible to prove there was a proof in the last lecture and that proof follows through for the control case as well roughly um and but I think a useful way to understand what's going on is to understand that really what we're doing when we do these traces is to implement this algorithm here so so we're implementing an algorithm um in an incremental way which is basically updating our Q values a little bit in a direction that mixes over our estimates at all these different time scales where we bootstrap from all from end steps from all different endep returns so we just you start in some State action pair um so if we go back to this view here you know another way to look at this what would the forward view look like so forget eligibility traces the forward view would say you know you start at each point in this diagram um now you really do wait until the end of the episode that's the forward view um at the end of the episode um from each of these states we would consider the one step return which will give you no information we' consider the two-step return which will give you no information consider the three step return which will give you no information and the only one which will give you the information is the the longer returns that go all the way to here or Beyond okay and the amount that that you take account of those things um the weight that we put on that final um n on that largest n the weight that we put on our final one is precisely the amount by which we would update that guy here so if you're very close to the end here all of your weight goes on your um on your return but on your on your final return because the episode stopped here whereas if you're way back at the beginning here a lot of your weight um is going on these shorter returns that don't contain any information yet and so that gives you the same effective result you can see just intuitively that just looking at these diagrams you get the same picture by either this forward view computation or the backward view computation um but the back of view is computationally much nicer we do that online you just keep one step in memory you just do that thing you move on you don't need to look forward in time or wait until the end of your episodes it works in situations where you don't even have episodes can I just ask one question about the size of the just storing the eligibility Trace yeah in the computer so is that a kind of Matrix which is has got two Dimensions one is the um number of states and the other dimension is the number of actions yes so so so in in this lecture we're building tables for both Q which is a matrix number of states times number of actions and um eligibility trace and the count when we were doing MTI this NS Comm a all of these quantities were tables and the tables have were number of states times number of actions in in size um you can store that as a matrix if that's a helpful way to think of it in the next class we're going to look at ways to approximate that potentially very large matrix by um an arbitary function approximator with a smaller number of parameters for both q and also for eligibility traces okay let's um change gear a bit um now we're going to talk about of policy learning so everything so far has considered the case of on policy learning where the policy which I'm following is the policy that I'm learning about um but there are many cases where we want to consider how to evaluate some other policy while we're following so let's say we're going to follow some Behavior policy that we're going to call me so this is the behavior policy that we actually pick actions from in the environment so me and what we want to do is to evaluate some other policy Pi so as to compute say VP or qpy or ultimately figure out maybe the optimal way to behave in that environment but the behavior that we see is not going to be drawn from this thing that we're that we're wondering about so why would we want to do that why would why do we care about off policy learning um so so here's four possible reasons that you might care about this a lot okay um the first reason is that you might want to learn from observation of other agents like humans you might want to see how humans behave um and not just do supervised learning to copy what the human did but look at the traces of behavior look at the episodes of behavior that the human executed and actually just from that experience figure out how to do better figure out how to solve the mdp just from watching some human you know Steer some boat around and then look at what they did and say h i know how you could have done better you could actually have got more award by following this other path around um that's the goal okay you want to be able to observe some other policy and maybe figure out how to do better or learn from that behavior learn um so learn from one Behavior how what the value of another Behavior would be a specific case um we use this for example in the Atari work in um that I mentioned in the first lecture is to reuse experience from old policies um so this is a case where imagine now that we're doing our generalized policy iteration so we move through a whole sequence of policies when we do that we start with some initial random policy Pi 1 and then we make it a little bit better to give us Pi 2 um eventually we end up getting better and better policies um but what if we want to go back over the data that was generated using those old policies and use it more than once instead of just discarding this data which is very data inefficient what if we want to go back over it and look again and say oh I remember I did this thing I've seen this state before when I was last year I did this other action I got this other award reuse that data do something a bit more like you know batch methods where you kind of um you just treat this as a big data set and and figure out how to do the best thing from that data set but what's the problem there the problem is that we're trying to evaluate our current policy when we're doing policy iteration we want to know this is our current policy we want to know how good that is but we've generated Behavior using all kinds of different policies so is it possible to make use of this additional data to get a better estimate of our the value of this final policy and the answer is yes if you use off policy learning you can basically just treat this as other behaviors that generated that data just like the human you know this could be think of this as the human and this as some other robot and and you kind of just want to look at all of that data and figure out from all those other sources of experience how to do better um probably the best known example of why of where um policy learning is used is this third point which is to say we know that there's this big um issue in reinforcement learning which is making sure that you explore the state space effectively at the same time we want to learn about the optimal Behavior which doesn't explore at all so why not make use of off policy learning to generate now we can have an arbitrary explor policy it just explores around with as much exploration as we want um and at the same time learn about the optimal policy that requires us to be off policy we want to learn about how to behave optimally something which is ultimately going to be deterministic whilst following something completely random okay that's another motivation for our policy and perhaps looking forward you know to Grand divisions the most exciting reason to care about off policy learning is we might want to learn about multiple policies while following one so there might be many different behaviors we want to figure out like I might want to know what will happen if I leave this room how much reward will I get you know will you guys be pleased or or disappointed I don't know you know maybe I should figure that out whilst carrying on talking to you um but I might also be wondering what would happen if I changed the topic and went back to three slides ago you know there's all kinds of questions I might be asking about the future which are conditioned on following different policies um and I want to know what would happen if I followed all of those policies and I want to learn about them all from one stream I only get this one stream of data that we call life and I want to learn about all of these other policies um and figure out the answer to all kinds of different conditional questions whilst following that one stream of experience okay these are all reasons to care about off policy learning so it's a it's a significant problem and it's actually a thorny problem for reinforcement learning as we'll see next class um but we're going to look at two mechanisms for dealing with this um the first of which is important s playing oh used a lot of time advertising it so um let me talk about this um so the first mechanism is to basically do important sampling and the main idea there is to take this expectation um there's some expectation like your expected future award and all we do with important sampling um is to say an expectation over you say your future reward is just to sum over some probabilities times say how much reward you got and now what we can do is just multiply and divide by some other distribution um and now this ratio here that we've got you can basically say this is an expectation over our other distribution of this quantity here where all we had to do was kind of divide out our old probability basically you divide you multiply and divide by the ratio between your old and your new distributions and that corrects for the change between your distributions um so we can apply important sampling to Monte Carlo learning um by basically doing important sampling along the entire trajectory so when we important sample AC in Monte Carlo we have to look at the trajectory we have to multiply these important sampling ratios across the entire trajectory like every single step there was some action I took according to um my behavior policy and there's some um probability that action would have been taken under the behavior I'm actually trying to learn about and so it kind of you have to multiply these ratios together and in general it becomes vanishingly small probability that the return that you saw under your behavior policy actually matched gives you any information at all about what would have happened if you followed some completely different behavior um so you can use this idea but it's extremely high variance and in practice it's just useless okay so Monte Carlo learning is a really bad idea of policy it just doesn't work because you're over many steps your um your your target policy and your Behavior policy just never match enough for it to be useful and so you have to use TD learning when you're working off policy it becomes imperative to bootstrap um and and so the simplest idea to use TD learning is just to import sample but you only need to important sample over one step now because we're bootstrapping after one step so all you have to do is basically update your value function a little bit in the direction of your TD Target um and the TD Target now all we're going to do is take that TD Target like what happened over one step of this action we're just going to correct for our distribution over that one step now we're going to say you know over that one step I had some Behavior policy that generated some action and there's some probability that I would have taken that same action under my actual um Target policy I'm trying to learn about and we just consider the ratio between those two things and we multiply our Target by those things so we wait how much we trust this target we look at by how much our our policy really matched what what was actually taken in the environment if these don't match at all we can't we can't consider this you know essentially this basically leads us to um contently reject something or to take huge account of something so you can still get high variance we still increase the variance by important sampling and it can be can still blow up uh but it's much much better than Monti Carlo now the idea which works best with off policy learning um is known as Q learning this is specific to td0 now um um to Sasa Z for example um and what we're going to do is consider a specific case where we're going to make use of of our Q values we're going to make use of the action values to help us do off policy learning in a particularly efficient way that doesn't require important sampling um and so the idea is very simple the idea is we're going to select our next action um using our Behavior policy okay um but we're also going to consider some alternative successor action that we might have taken um had we been following our Target policy so example this is our random variable representing the real thing we took in the in the world that's time step T plus1 and this a prime is like our alternate thing like we're just imagining if I'd taken this thing under my target policy you know what what what might the world have looked like and now all we're going to do is we're going to update our Q value for the state we started in and the action that we actually took we're going to say well the value of that state in that action we actually took we're going to update towards the value of alternate action when we bootstrap we bootstrap from the value of our alternative action because that's the thing that tells us how much value we'd actually have got under our Target policy when we sample from our from this guy that tells you something about what really would have happened under our Target policy so it looks a bit like this we now replace our sassa update where we have our Q value for sna being updated a little bit direction of the reward that I actually um saw so I was in this state I took this action for real even though it wasn't the one that I would do now it doesn't matter we're just asking how much value would I get if I took that action and going to update it a little bit in the direction of the reward plus the discounted value at the next state of my alternate action under my target policy now this is what would have happened if I was really following the policy I care about so this is a valid update this is the bman equation again it's completely valid we the Bellman equation for Q doesn't require us to import in Sample we're sampling that kind of bman equation that's the idea of Q Landing yeah can you just be clear like which parts are from your behavior that you stored and which is online in that last equ um nothing is stored I'm not sure so where the something at plus one from so you're sampling at plus one from your behavior policy and you're sampling a prime from your target polic so each time you get to a state you're considering the thing you actually took um and you're also that's generating a real Behavior the thing you actually took um and that's the trajectory you're actually following but every step you're also considering some alternate action you might have taken and that's the thing you bootstrap from you bootstrap from that Q value now there's a special case of this which is the well-known Q learning algorithm so if you hear about the the Q learning algorithm um essentially is what we're going to talk about in this slide um and it's a special case where the target policy um is a greedy policy so this is the special case where we're trying to learn about greedy Behavior while we're following some exploratory behavior um so and the special thing about Q learning is that both the behavior and the target policies can improve we allow Improvement steps to both of them um but what we're going to do is basically at every step we're going to make the target policy greedy with respect to our value function so we're going to ask about this deterministic thing which really acts you know as aggressive as possible we're trying to learn aggressively about what we consider to be the best behavior given our Q values but our Behavior policy is going to be Epsilon G so Behavior policy is roughly going to follow sensible things to get us to good parts of the state space but it's going to explore a little bit as well to make sure we get to new things as well um and when we plug this in to the previous slide um we basically end up seeing that so this is our the target we had in the last slide our R plus Q the Q value of our alternate action our alternate action um is this argmax the greedy policy so when we plug in this argmax if you basically want to know what's the Q value of the argmax of your Q values that's the same as the max of your Q values so it's basically what Q learning does is it updates a little bit in the direction of the maximum Q value that you could take so that looks like this so we're just going to have one um diagram to display this um this is the wellknown Q learning algorithm I call these Sasa Max updates because youing s a r s Prime and then the max of your a primes um and what we're doing is we're maxing over these guys we're updating our Q values a little bit in the direction of the best possible next q value you could have after one step that's the idea so it should be familiar from the Bellman optimality equation this is basically um just like bman optimality equation and this algorithm again converges to the optimal action value function um so it's a standard idea for reinforcement learning um so just to wrap up um I just want to um bring out these relationships between these full width updates that we had in dynamic programming and what goes on in temporal different learning um so in this column here what we've got is a dynamic programming algorithm that we looked at in previous lecture on the right hand column are these sample based algorithms TD and algorithms that we've looked at in the last lecture in this one okay um and so if you use the Bellman expectation equation so this is which Bellman equation use if you use the bman expectation equation for the value function V for the state value function um you can you can use dynamic programming to evaluate that your current policy or you can just sample this thing and take just one sample of this diagram by sampling from the environment and sampling from your policy and that gave us TD learning so these guys are just a sample of these guys wherever you've got an expectation here you sample it to get your TD learning algorithm here and that gives you your target so the target for your TD learning algorithms are samples of the right hand side of the Bellman equation um when we do the Bellman equation for Q pi um we can actually do um Sasa now so so when we use the Bellman equation BMA expectation equation for Q Pi we can use that to evaluate our Q values and then plug that into our generalized policy iteration to make our policies better and better and better so we can either do that using a policy iteration framework for dynamic programming or that gave us the Sara algorithm and then finally we can use the bman optimality equation the qar and when you use the bman optimality equation um you can either plug that into value iteration um which is the dynamic programming method or you can just sample this sample your expectations to give you the Q learning algorithm okay so TD we can think of as samples of the bman expectation equations or bman optimality equations they're kind of doing a one-step sample of what would happen if you were doing dynamic programming so that's just kind of to tie things together okay thanks everyone next week we'll talk about function approximation how to scale up how to make these things work in larger and more practical examples cheers guys\n"