RL Course by David Silver - Lecture 7 - Policy Gradient Methods

The Deterministic Policy Gradient Theorem: A Simple and Intuitive Approach to Actor-Critic Algorithms

We're never adding any noise in and it has this particularly nice form where we say all we need to do to follow the gradient to adjust the parameters of this policy. This should have a u there as well if we just want to adjust the pars of this policy to get more objective. All we need to do is to um take the gradient of our own Q function so we look at our critic. Our critic basically tells us hey look you know if you were um it gives us the gradient that says this is the way that we'll give you better actions. It says actually look this is how you this only works for continuous action Spaces by the way but it basically says here's the gradient of your value function with respect to your actions this is how if you were just to make this action here a little bit higher you would get more reward if you were to make this other action a little bit lower you'd get less reward. So the critic already contains all that information about how to adjust the actions so to get more or less reward.

So as to get more reward like just you know just move left a little bit more and you'll get a lot more reward that's that's this gradient of Q and then all you need to do is to work out how to adjust your parameters so as to get more of those actions which are which the critic is suggesting. And so that's the deterministic policy gradient theorem. It's very simple intuitive and in practice it seems to work a lot better than SC gastic policy gradient um in situations where we've got continuous action spaces scales up much better to high Dimensions.

Okay um I'm going to skip the natural um actor critic um so there's one more approach which is in the notes which is the natural act critic. Um natural act critic is an alternative descent direction or Ascent Direction um which has some nice properties and it turns out it falls out very nicely in the acoc critic framework that's there in the notes feel free to refer to it.

Um but I wanted to put up this summary slide before we finish um and this is like a summary of all the different manipulations that we've done um to our uh to our different actor critic algorithms so we started off with basically our policy gradient theorem what we did was we plugged in different approximations to this thing.

Um where we said okay the gradient of our of our policy we can get it by taking the score function and multiplying it by by the return. Um we can multiply it by the value function. Um we can multiply it by the advantage function. So that should be a D um. We can multiply it by the TD error this was another way to estimate the advantage function.

We take a sample of the advantage using the TD error um we can use an eligibility Trace where we accumulate our scores instead of just using the current score we accumulate an eligibility over all of our scores over many time steps and we use that eligibility Trace instead of just the one current score. Um and we can use the deterministic actor critic where we basically move in a direction that gets us more Q okay all of these are essentially different variants of the same idea they're all different ways they're different manipulations of trying to say move the policy in the direction that gets you more value.

And then the way that we estimate values in the critic um is what varies here um and the way that we make use of this how we update our policy is what varies between here and here and here so the the critic can vary the actor can vary but essentially they're all doing roughly the same thing. For each of these cases we can make a stochastic gradient Ascent algorithm where essentially all we do is we just drop this expectation by sampling the thing which is inside.

And in each case you end up with a very very straightforward simple algorithm where all you do is you you take a step or an episode if you're doing the top one um and you look at what happened in that one step only. Um we now have the gradient of the score for that step we took an action from some State we know for that state how to have adjusted our policy so to so as to have taken that action more often.

And we look at whether that step was good or bad according to our critic if it's good we push our policy parameters to do that thing more often. So all of these have very simple stas IC updates um and then the critic we basically use our favorite algorithm Monte Carlo TD TD Lambda you know it's the whole family of different approaches that we've we've considered before in previous lectures.

Okay thanks everyone see you next week

"WEBVTTKind: captionsLanguage: enokay so a couple of announcements just before we begin sorry about that it's a very temperamental projector um so first of all um the final lecture lecture 10 um I sent an announcement to the mailing list but just in case you didn't catch that that will take place the morning of the 1st of April so this isn't an April Fool so if you thought this mail came to the list and you're thinking oh I don't have to wake up that day um there really is going to be a lecture um I appreciate that that is outside the time where classes normally occur um as a result um this is an optional lecture um so it's going to be a lecture on games case study on classic board games with reinforcement learning um I think it's interesting it'll pull together a lot of elements of the course um but it's optional so if you've already booked your time away and you can't attend then don't worry it's not going to affect your um your exam um but if you do have the chance to come along it should be interesting and um I'll also mention that all of these uh videos will be made available online very shortly um so if you do miss anything you can catch up online um second announcement is the deadline for the assignment um so that will be um the following day April 2nd so hopefully um you guys have made a start on that um if you do have questions um please send them to the to the mailing list because if you're wondering about something probably means 10 other people have the same question and so it's quite useful just to Ping those backwards and forwards on the mailing list to just make sure everything's Crystal okay um so let's get going so today's lecture is on on um policy gradient methods um this is a very popular widely used Class of reinforcement learning methods again we're in the part of the class where we talk about things which are actually um used in practice to solve interesting challenging reinforcement learning problems um so just to give a an outline of how this is going to pan out uh we're going to start off with an introduction where we'll understand roughly what's going on um and then we'll talk about uh we'll focus in on policy gradient methods so what's going on in this class is we're really talking about methods that optimize the policy directly so instead of working in value functions as we've considered so far this is now about working with the policy how can I see some experience and from that experience figure out how to change my policy in a direction that makes that policy better um and what we'll do is we'll see um a variety of different methods for doing this which are all based on the simplest principle if you like for improving a policy which is policy gradient so we adjust the policy in the direction that makes it better we follow the gradient we'll consider three different ways to do that we'll start off with um a sort of Na naive but sometimes effective numerical approach by using finite differences and then we'll move on to see that actually um for many of our reinforcement learning methods there is an analytic formulation for the policy gradient so despite um what you might think you can actually just have an agent which follows its trajectory sees how much reward it gets along the way but also knows which direction to adjust its policy so as to move it directly in the direction that makes its its policy better and finally um the most practical set of methods which we'll consider are the actor critic policy gradient methods so these are methods which combine everything from this class with everything from the previous class so they have um they work with both value functions and with policies and try to get the best of both worlds okay so introduction so last lecture um we had lecture where we talked about value function approximation we considered two different types of value function approximation we considered approximating the state value function so this is the true State value function that tells us how much reward um can I really get in this mdp in expectation if I follow this policy Pi so if I'm in some State s what's the true expected reward I'll get from that point onwards um or we had the the Q function which tells us for some State s and some action a what's the true cumula accumulated reward along some trajectory that I can get from that point onwards taking that action under this policy so those are the ground truths and what we did in the last lecture was we approximated those things using a function approximator like a linear combination of features or a neural network we tried to estimate this thing we had some parameterized function which had parameters w we basically tried to fit this thing using ideas which are familiar from supervised learning we Tred to fit a function to this ground Tru so we SED started to understand the shape of how much reward can I get from different parts of the state space and different parts of the action space respectively um but in and in that in that lecture so last lecture the policy which we then used when we actually wanted to do control um we didn't represent the policy explicitly instead we just used the value function to pick actions so once we got our our function approximator for Q that was enough to pick actions because we could just act greedily or perhaps Epsilon greedily um to pick the the action which gets the most Q um so that was a very effective approach to control using function approximation um but it's only one approach to function approximation and in some sense uh more direct or um sometimes more natural approach is to directly parameterize the policy not the value function so now we're can to have a different U function approximator we're going to have some parameters U so in the last lecture we had W so the difference between last lecture and this lecture is like whether it's w or U um and we're going to use these parameters U and we're going to param the policy now so we're going to parameterize this distribution our policy Pi is now going to be something we directly manipulate so we control these parameters which affect the distribution by which we're picking actions so think of this if this was deterministic which we'll consider later on this could just be a function the policy could just be a function you know which action do we pick in in a given State and now we learn the parameters which tell us you know for each state um which function should I which um which action should I pick and so again you could think of this as some arbitrary function approximator like a neural network where you plug in your state um and it tells you which action to pick or distribution of actions to pick and so formally this policy is going to be now probability distribution over actions that's conditioned both on the state and our parameters so we're actually defining probability distribution and we're tweaking we're learning how these parameters should change the probabilities by which we pick different actions in different states and what we're going to do is try and understand how to modify this parameter Vector U so as to solve the reinforcement learning problem so as to maximize the amount of reward that we can get in this environment and again what we're going to do in this lecture is we're going to focus on model free reinforcement learning so we're really going to focus on how to throw down your agent throw your robot into some Maze and and this thing just wanders around and just directly from experience without anyone telling it the Dynamics of the environment it should be able to figure out how to adjust its parameters of its policy so as to get more reward from the environment and in case it's not clear the main motivation behind both of these methods is that we want to be able to scale we want to be able to scale to large interesting mdps large complicated environments um in which it's not possible to separately for each distinct State say oh I'm going to take this action here in this state and this action here in this state so we need a function approximator we need some parameters that sort of specify um how across this state space we're going to modify these parameters uh apparently I spent too long on that slide Ah that's the power so while we're waiting for my computer to come back to life any questions hopefully it's um the motivation should be clear um so so what we'll start to do now is to understand how we can actually take this parameterized um policy with this parameter Vector U and start to adjust those parameters so as to get more reward and the main mechanism that we're going to consider is gradient Ascent in other words how can we compute the gradient to follow so as to make this policy better and if we can follow that gradient we will strictly be moving uphill in a way that improves our policy if we know the gradient with respect to the total reward we're going to get in this mdp we're just going to follow that gradient in a direction that gets us more award what we're going to do over the next few slides is formalize what that means if my computer comes back to live um so that we can start to understand how to maximize reward um and we'll consider a few different ways that we can formalize this objective function um so before we do that I'm going to ask something to you guys while we're waiting which is can you think of any advantages or disadvantages of using policy gradient um compared to value based methods like why why work with policy based methods and you'll think why it might be a better idea to work with policy based methods than value based methods good uh you need to store few things if you're working right so you might need to store more fewer things um so why why why do you think you might need to store fewer things because you don't have to store the value your approxim right so so there are some situations where it's more efficient to represent the value the policy than the value function there might be situations where imagine you're playing one of these atar games like breakout or something and it might be that the value function is quite complicated it might be that actually precisely working out how much score you might get from some point onwards might be tricky um but to represent the fact that if the ball is coming down over here you should always move left might be a very simple um policy to represent whereas actually working out precisely that you're going to get 1,714 points if you move left might be much more complicated function to to to approximate so there are cases where a policy can be more compact um so now we've got the slides back I'm going to um come back to this yeah this question here so so some other reasons why it might be a good idea to use um policy based reinforcement learning so maybe the main one you'll see in the literature is is that it has better convergence properties that we saw that there are some cases where value based methods can oscillate or chatter or even diverge if you do things in the wrong way whereas if you just follow the gradient of your policy um these methods tend to be more stable you're smoothly updating your policy you don't get these dramatic swings in what your what action you're going to take because you just make a little incremental change to your policy and you move in the direction that makes your policy better and so they tend to have better convergence properties and indeed if you just directly follow the policy gradient you're guaranteed to converge at least to local Optimum um another big one is that these methods are effective in continuous action spaces or high dimensional action spaces with a value based method you have to work out how to compute the max and I would say this is probably the number one reason to use policy gradient methods is um they circumvent the requirement to have this Max so if we're doing something like Q learning or indeed if we're doing something like Sasa there's always a maximization I've always been arguing that well once we've got the Q um once we got Q all you need to do to pick actions is to maximize over a and you pick the you pick the action with the highest q and you're done you you you know you found the the best way to behave but what if that maximization is itself prohibitively expensive what if you've got you know a trillion actions or a continuous action space What do you do this space this is actually quite common if you've got a robot it operates in a continuous action space of of TS and so forth if you've got if you're trying to I don't know solve an internet based problem where trying to create the perfect um advert to show to your um to your users um actually creating that advert might have many many different components that you'd like to move around and you'd like to make a sequence of of adverts that you show that actually maximize your your income with that user this is a vast action space so what do you do um well you can solve this complicated maximization every step or you can use policy based methods where you just adjust the parameters of your policy directly and you never have to solve for this map it's like you're incrementally starting to understand what the max will be rather than estimating the max directly every step um and the third reason that you might want to consider these policy based methods is that they can learn stochastic policies in the next couple of slides we'll see why that can sometimes be a good idea um so disadvantages so the main disadvantage is that naive policy based reinforcement learning can be slower and higher variance and less efficient than value based methods I think the way to think of this is that value based methods when you use the max are ex aggressive it's like this Max in one step is trying to improve your policy in the direction that absolutely pushes you all the way to what you currently think is the best policy whereas policy gradient methods just take a little step in that direction they smoothly update in that direction which makes them more stable but also sometimes less efficient and sometimes actually Computing that step to take that gradient if you do it naively it can be higher variance and slow down the whole algorithm but we'll see later how to get the best of both worlds by including a value function and in practice you may want a value function as well which can add to the complexity of the algorithm a little bit and just the flip side of the point which was made earlier sometimes there are examples like we discussed in breakout where a policy can be more compact um but it's also conceivable that there are cases where value function gives you um a very nice compact representation and gives you a more direct way to learn your U solution to your mdp okay so so so let's consider how can it be the case why would we ever want to learn a stochastic policy you know ultimately aren't we just maximizing don't we just want the thing which maximizes don't we want to maximize our reward and if we're always maximizing reward intuitively you might think a deterministic policy should be good enough because deterministic policy is always some you you can always find it um maximizing something um can be a deterministic process but there are cases where it's actually desirable to have a stochastic policy and we'll see two of those now um so the first example is rock paper scissors or Ro Lambo or it has lots of different names depending where you're from and how you play this um but it's just the game where you show paper scissors or stone and scissors beats paper rock beats scissors paper beats Rock um and so you probably know this but if you were to just play this deterministically any deterministic strategy is very easily exploited if I always play rock and I start to play against you you'll very quickly figure out that you should just play paper and and I'll lose every single time and in fact um the only optimal policy in this domain in terms of a Nash equilibrium which is the game theoretic notion of optimality the only optimal policy is to play um actually uniformly at random anything else can be exploited because if you play one choice more often your opponent will learn to exploit that so this is an example where the optimal behavior is actually um stochastic a stochastic policy is necessary a second example is when we have state alus in so this comes up when the markof property doesn't hold so we might be in an mdp but it might be that we we lose the markco property we get partially observed environment so if you remember this distinction between fully observed and partially observed what's going to happen now is we're only going to see certain things about this environment or equivalently we're only going to use features um which give us um incomplete information about where we are in this environment so consider this this little problem here so as you might get yes getting to this treasure is good and getting to the skull and crossbones is bad um and this is just a little grid world you get to move around it go northeast south or west um and the question is um what's the uh best that you can achieve in this domain except we're going to restrict the way that we can create our value functional policy to use the following features so we're going to cons consider features which basically tell us they features of both the state and action so they're going to be features that say things like is there a war wall um to the north of me is there a wall to the south of me um in between these squares by the way if they're adjacent there's no wall between them so this is like is there an exterior wall um and the other part of the feature is is there an exterior wall to the north and I'm moving to the East and that would be a feature of my state end action pair okay um so we're going to consider features like this is there a wall to the north when I'm going east and we're going to consider all such features that is there a wall to the West when I'm going south and um all kinds of things like this um and we want to know you know what's what's the best I can do using either value based or or policy based reinforcement learning um so in each case we're going to approximate either the value function using these features and doesn't matter what the function approximator is this could be linear it could be a complicated neural network it could be your favorite thing the question is what's the best that could be done in principle um and we're going to compare it to policy based reinforcement learning where we parameterize a using some function of these features and the question is which will do better so does anyone have any intuition into this I'll show it over the next couple of slides but just to have a quick think if you can figure out what's going on yeah you're not going to be able to tell the difference between those two gray squares with the features right so so the point was that these two gray squares are Alias there's no way to tell the difference between these two gray squares using the features that you have like if you're here and going west or if you're here and going west those two things look exactly the same to you um because you've got a wall to the north you've got a wall to the South um and you're going west so this your feature Vector will be identical and you can't tell the difference between those two states so as a result um if you use a deterministic policy then you have to pick the same action in those two great States a deterministic policy which is parameterized in terms of those these two things um and and so imagine that you you learn a value function now the value function for these two states has to be the same and if you act greedily with respect to that value function you'll either go west all the time or you'll go east all the time okay um and so this greedy idea of generating a deterministic policy or even a near deterministic policy would basically lead to the following Behavior which is if you're on this side um of the of the map you would oscillate backwards and forwards forever if you're using greedy or for quite a long time if it was Epsilon greedy and you'd never reach the treasure or take a very long time to reach the treasure if you use a stochastic policy um then things are much better because we can we can choose we can optimize our policy so as to choose a policy that goes east or west with 50/50 probability here so now if we're in this gray State we'll we'll just randomly pick a dire Direction and if we find ourselves over here well sometimes we'll go left but sometimes we'll go back right and then we'll be pushed down to the treasure so in just a couple of steps we're likely to reach the treasure compared to a value based method um where if we were to act greedy with respect to our value function we would end up taking in certain cases a very very long time to reach the treasure possibly forever okay so the summary of these couple of slides is to say that whenever State a lising occurs um a stochastic polic can do better than a deterministic policy so we had a theorem early on in the course which was U that there's always a deterministic um optimal policy for any mdp but that's for a markof decision process when you have that was in the table lookup case where we have the perfect state representation um the moment you have state aliasing so you're partially observed um so if you're in a partially observed Mark of decision process or if you're function approximator the features that you use limit um your view of this world which is equivalent of being in a par observe mdp then it can be optimal to use a stochastic policy in which case policy search methods can do better than than these value based methods which just act greedily okay so let's talk about how to optimize a a policy so what does it mean to optimize a policy well we need to know what the objective should be so here's three different objective functions we might choose we want to know how good is this policy um the first choice is um we so we're going to consider some parameterized policy with these parameters U and we want to know essentially what's the best U um so the first case we're going to consider um is if we're in an episodic environment we're going to consider the starts value this basically says if I always start in some State S1 or if I have some distribution over start States S1 What's the total reward that I'll get from that start State onwards okay so this only works if there's this notion of a start state so if there's a start State you know you might say really the thing I really care about is I want to find the policy which when I dump my agent down in the start State I start my game with Atari and I play it all the way to the end it gets the most score um or we could use the average value and this is something we could do in a continuing environment so now an environment which just goes on forever they might not be a start state so now we might want to say well let's consider the probability that we're in any state this is the stationary distribution this is the probability that we actually end up in any state under our policy Pi um times the value from that state onwards so it's like averaging over the values of all states I might be in state one and and get um 10 units of reward from that point onwards or I might be in state two and get 20 units of reward from that point onwards and so we might average over those and say well the objective is is maybe 15 um or we might consider another formulation which is the average reward W per time step which is similar but we might say well what we really care about is just let's consider some continuing environment I'm going round and round and around and all I care about is to make sure that per time step I'm getting the most reward so we're not summing these rewards over time we're just saying we're going to take the average of my immediate rewards over the entire distribution of states that I visit so this is basically saying the objective is there's some probability that I'm in a state with some probability I'll take an action from that State under my policy and this is the immediate reward that I get at that step what we care about is getting the most reward per time step that's another way of saying to extracting the most juice from my environment I get the most reward per time step um this relates to the average reward formulation that was like an extra for um lecture two that you find in the notes okay so it turns out that fortunately for us that exactly the same methods apply to all three of these um they turn out to the policy gradient is essentially the same for all of them um the only which differs is the definition of this distribution term here like what's the probability that I'm in some State s um and and they're just rescalings of each other um so they turn out to basically follow the same gradient Direction and so we don't really need to worry which of these we're following because we end up with the same policy gradient um so now what we're going to consider is trying to optimize one of these objectives so we're going to pick one of those three objectives and we're going to say what we want to do is to find the U find the parameters that maximize this objective function so find the U such that when you drop your your ibow robot down your ibow will walk as fast as possible um and so how can we do that well there's lots of familiar approaches to optimization hopefully you've come across some of these um optimization methods are classically divided up into gradient based methods and um gradient free me methods for optimization so if you don't have access to the gradient um there's some algorithms you could use for optimization like hill climbing or Simplex method where you kind of got a triangle that you flip over and over and over as it folds its way down the hill sometimes called the amoeba or nomid um genetic algorithms these are all optimization methods that you could throw at a a policy search problem and try and find the best policy which maximizes one of these objective functions but typically the reason people consider gradient base methods is if you have access to the gradient you can almost always get greater efficiency out of your optimization method um and some of the familiar methods for doing this are things like gradient descent conjugate gradient quasi Newton methods these are there another large family of methods that make use of the gradient information um to gain greater efficiency because you know this gradient points you in the direction of of Greater um reward more where points to the direction of how to do better in this mdp rather than having to blindly try things by trial and error figure out which of those is the best so we will focus on gradient descent the simplest instance of this of course once you understand how to do gradient descent you should also understand that this can be extended to other methods um for example second order methods quasi Newton um all of these things are possible but I think the ideas are best Illustrated and often work most effectively with the simplest case um which with very little computation um you can get a great deal of um um effective reinforcement learning out of this we're also going to focus on methods that exploit the sequential structure by which I mean we're not just going to do blind optimization like a genetic algorithm um where you have to run you have to throw your robot down wait until the end of its lifetime to get one number that pops out of it that says you know I achieved X in my lifetime we would rather break open the trajectory see that partway through this trajectory it had achieved some um sequence of states and rewards and make use of that sequence of states and rewards to to um do better than waiting all the way until the end until this thing dies before you can actually optimize it so we'd like to learn within the lifetime of an agent okay so we'll start with a very simplest version which is finite difference policy gradient methods um so so let's begin by this our familiar idea of what policy gradient is so in the last lecture we considered policy um we considered gradient descent because we had something like a mean squared error and we were trying to minimize the error um in this lecture we're going to consider gradient Ascent where we have an objective function which is something like how much reward can I get out of this system and we want to make this thing higher we want to basically Ascend our objective function so we can imagine that there's some surface that says you know for um any policy parameter so imagine that these are two par two different policy parameters and there's some surface defined by those two different policy parameters and they might be parameters of how my eyebow walks around and we're going to adjust those policy parameters um in the direction that gets us a faster walk for example um and and this surface has some shape and the gradient basically points us uphill in the direction of steepest Ascent uphill and so all that says is mathematically what we're going to do is adjust our policy parameters a little bit in the direction of the gradient of this objective function and the gradient of this objective function is just the vector partial derivatives so it's the vector that says you know along this direction here um how much progress can I make by adjusting this parameter along um this other axis here how much progress can I make and we just move things in a direction that maximizes our progress um and makes the most possible difference in a infantes radius around where I am now so we're adjusting the parameters to get more juice out of our environment okay um so the simplest way this is the naive approach the numerical approach um if you had no way to understand what the gradient was you could just do the following you could basically um look at your objective function and estimate this thing numerically by saying well if I just perturb my parameters a little bit I could perturb them um in each Dimension I could say well what would happen if I changed my policy parameter along the first axis a little bit and I would look at the difference between um the objective after I perturbed it and the object before I perturbed it that would be an estimate of the gradient in that direction um and then we do that for each Dimension separately and that gives us a numerical estimate of the gradient um this is a little bit naive because you need an awful lot if you we're working High dimensional space you know imagine your policy has is a neural network with a million parameters now you need to do a million different evaluations of um your policy you need to perturb it in all those different directions to get an estimate of what the gradient is in that direction um there are techniques that um use random directions like spsa um which reduce the number of samples you need to take but they still are very very noisy and so it takes an awful lot of computation using these methods to actually even get one estimate of the correct gradient but nevertheless it's simple there's no um you know this is if you really can just perturb your policy and try again multiple times you will get something which points you in the correct direction eventually um and so it you can work for arbitary policies and one big Advantage is this works even if you've got some non-differentiable policy like someone exposes your gives you the program for your robot and tells you you know here's some parameters for your program but we don't know how to differentiate them you know you can still you can still adjust it by by following this process um so I've been talking about an IO so let's make that concrete um so it turns out that in robocup soccer um things have got a little bit better but I think it's roughly still true that with the IBO league in particular this is one of the leagues of Robo Cup soccer they have these little dog robots the ibos running around trying to score goals against each other um so it's quite a fun competition um but the thing which determines which team wins is basically how fast you can get your your eyebow to to run if it can walk faster you basically are able to just take the ball and move it faster down down the field and so um so the team with the fastest iOS historically has always won and so it's become an interesting machine learning challenge to make the IBO gate um more efficient and make it run faster um and so the IBO gate is basically controlled by 12 different U real numbers which are um sort of shown in this diagram here doesn't really matter what they are um but the goal is to adapt these parameters now and the first people to do this Peter stonel they did it by finite difference policy gradient um just like we just described in the last slide um and what they did was they basically just ran these robots backwards and forwards along Ong this field and measured how long it took them with each gate and gave them a few walks up and down and measured how long it took um did that for each of these 12 Dimensions separately um and then you say you have to run it up and down to get one dimension run it up and down again 12 times to get just one gradient update you adjust your parameters in that direction and and iterate this process until your grad students collapse of exhaus you um and and so they did this um and let me see if I can pull up some some videos um okay so this is at an early stage before it's really being trained and I think what the main thing to notice is that it's um it slips a lot that this thing is really dependent on the on the ground which it's on and it tends to slip an awful lot um and over time it starts to learn something a little bit better this one uh oh oh no there we go let me try that again I think that was just a artifact no but you can see it's learned this much better gate that um deals with the um the slipp of the of the surface much better and it can move much much faster so this was just by adjusting the parameters in the direction of the gradient in a very naive straightforward way um and you know some limited number of iterations were sufficient to make progress in this domain and actually figure out a a reasonable gate now if you move to higher Dimensions um these numerical methods tend to collapse and so what we'll consider for the rest of the class is how to analytically compute the gradient so you don't have to laboriously compute this gradient separately for each of your um dimensions okay so we'll start with um simplest approaches no value functions yet um that's the Monte Carlo approaches and then we'll come back and we'll bring value functions back into the picture Okay so we're going to use the following idea which is we want to compute the policy gradient analytically and what we're going to assume is that our policy is is different iable doesn't actually technically have to be differentiable everywhere it only has to be differentiable when it's actually picking actions so if you have some non- differentiability but you're not actually ever picking that action that's okay um we're also going to assume that we know the gradient of our policy that we're in control of our policy you know something like a again could be a gum policy or a softmax policy we'll see some examples or a neural network um and we know the grent of this thing because we've created it um and what we're going to use is a trick called likelihood ratios um and the likelihood ratio trick I'm just going to put it up in its in isolation because what we're going to see is pretty much for the rest of the course this magic gradient of the log policy term is going to appear um and I think it's useful to understand why this gradient of the log appears because otherwise you'll just be scratching your head saying you know why why are we suddenly looking at the log of the policy um not the policy itself and it basically comes up just from this following um likelihood ratio trick where what we're going to want to do is we want to take the gradient of our policy we want to understand the policy gradient and we're going to want to take expectations of that thing and so the way we do that is we basically note that we can multiply and divide by our policy without changing it and at this term over here on the right the gradient of the policy divided by the policy is equal to um the gradient of the log of the policy okay um so that's just um straightforward um calculus and this term over here is a very familiar quantity from statistics and machine learning it's sometimes called the scrore function but if you were doing something like maximum likelihood learning this is the term which would come up if you were just trying to maximize the likelihood of some action a which someone gave you you would follow this gradient this is the thing which maximizes the log likelihood um and so so the way to think of this term is this is the thing which tells you how to adjust your policy in a direction that gets more of something so if you were to tell it the right action and you were doing supervised learning you would plug in exactly this term for some particular supervised action here and that will tell you how to get more of that particular action so this tells you how to adjust your policy in a particular direction to achieve more of something that that's good um and so we're going to use that and what we see now is that by basically rewriting this gradient in this way we're able to take expectations like Computing the expectation of this thing is hard but Computing the expectation of this thing is easy because we have this policy here and this is actually the policy we're following and so we can take an expectation of this term because this is something we can actually sample from this is the policy that we're actually following ourselves um so we'll see that in a minute but just before we do that let's consider this score function and understand what it looks like for two common examples so perhaps the simplest example you should have in your mind which we use for discrete actions is the softmax policy so the softmax policy um is basically something where we want to have some um smoothly parameterized policy that tells us how frequently we should choose an action for each of our discrete set of actions so it's an alternative to something like Epsilon greedy or um and and the idea is to basically say um what we're going to do is we're going to take some features we're going to form some linear combination of features for example and we're going to consider that as like there's some kind of value that tells us how much we'd like to take an action so we've got features of our action in our state we've got some parameters for those features and then we waight those features by um by some weight and we add it all together and that tells us how much we desire that particular action a and then what we do is to actually turn that into a probability we just exponentiate it and we normalize so the idea is that the probability that we actually pick an action is proportional to the exponentiated value that we get when we take a linear combination of these features so this is called the linear softmax policy um so softmax in general is a policy that's proportional to some exponentiated value and you can choose those values to be anything you want and here we're just parameterizing those values in the simplest possible way by using a linear combination of some parameters with our features so this is a very straightforward way to um to parameterize the policy in a discrete domain so imagine we're doing Atari um and we want to know you know should I go left or should I go right well we'd have some features um of going left and we'd have some features corresponding to going right we would weight each of those and whichever one scores more highly when we make this weighted sum um would get higher probability when we actually come to pick these actions and now what we're going to try and do is find the gradient of this thing we want to know well how do we adjust this thing so as to get more of the good things so we need to know the score function and the score function is very intuitive for this softmax the score function the gradient of this lock policy is just the feature for the action that we actually took minus the average feature for all the actions we might have taken so so it's basically saying how much more of this feature do I have the usal and that's the score function and so what we'll see is when we start to do these kind of policy gradient algorithms what we'll actually end up doing is saying you know if a feature occurs more than usual that's what this thing is telling us if it occurs more than usual um and it gets a good reward then we want to adjust the policy to do more of that thing okay so this is the softmax policy we're also I just want to give one more example of of policy which is the the gaussian policy so in a continuous action space like the IBO example um the commonest policy to use is a gum policy um and what do we do with a gan policy well we basically just parameterize the mean of this Gan and then you have some Randomness around that mean some variance around that mean that says most you know most of the time I'm going to take this mean action that's given by say a linear combination of some features um but sometimes I might take some deviation from that mean and be some variance like Sigma squ um which could also be parameterized you could have a variable variance um but the simplest case is to say now the action is selected According to some normal distribution with mean mu of s where this might be a linear combination of our features like before we've got features of our state and some weights for those features and some variance parameter so basically we get to pick the mean and then we just um add some noise to that thing to make it to fastic and so if we were using a gan policy what would this score function look like well the score function is basically again this tells us you know how to get more of of a particular action um and and here again we have something very similar to the last slide where this tells us this is the action we actually took this was our um our action a u and this is the mean action this is the mean so we basically want to know you know let's take the action we actually took minus the mean that tells us how much more than usual we're we're moving we're we're doing a particular action um multiplied by the feature and then we just scale that by the variance so in both of these cases the score function takes this form of sort of how much more than usual am I doing something um and and that's quite intuitive when we start to look at these policy gradient updates okay so that's just to give some intuition for what these score functions look like um now let's talk about the actual policy gradient this analytic policy gradient and to do this I'm just going to derive it for the simplest possible case and you know homework if you like could be to extend this to the full mdp case uh we'll talk about you know what it looks like but I won't derive it for you um and the special case we're going to consider uh onestep mdps so what do I mean by onestep mdp I mean that you start in some start State s um you get to take one step and you get one reward that depends on where you were and what action you took but then the episode terminates immediately so there's no sequence in this case it's just a one step and then you're done and you get some reward and the goal is to pick your actions so that you maximize that reward and the right thing to do might depend on the state and and the action um so we'll see later this is also called in subsequent lectures a type of contextual Bandit um and so now if we were to try and solve this um we want to find the policy gradient we want to understand and how to adjust our policy parameters so that no matter what state we're in we pick an action that's that's good we want to move our adjust our way of picking actions so as to get more rewards and the way we're going to do that is to start off by picking our objective function so in this one step case all three of those objective functions we talked about become the same um essentially um yeah and and so our objective function is just to uh the expected reward our policy so we want to find the parameters which give us the most expected reward and so we want to find the gradient of this thing and send the gradient of this thing which will give us more expected reward so if we're thrown into any state and we pick an action according to our policy we want an expectation that to give us the most reward and so the way we're going to do that um to just expand this out um so this is the expectation over the state that we start in and this is the expectation over the actions that we pick under our own policy so this is the expected reward um and now we're just going to plug in the likelihood ratio trick so we're going to take the gradient of this whole thing we want to know the gradient that gets us more award so the gradient of this whole thing um basically the only term which depends on you is our policy um so the gradient of this whole thing um we can push the gradient right in to the gradient just of this term here um and expand that using our likelihood ratio trick so this whole thing the gradient becomes now the policy multiplied by the gradient of the log policy multiplied by the reward that you get and so this is again now an expectation that's the whole point of using the likelihood ratio trick we start with something which is an expectation you take the gradient and you recover something which is still an expectation so this thing is an expectation under our policy so this is the expectation under the start State an expectation under the action that we pick of the gradient of the log policy multiplied by the reward so it's just the expectation of the the score times the reward so after all that we basically come back to something very simple which tells us that if we want to increase our our objective function if we want to get more reward we just need to move in the direction um that's determined by the score times the reward okay and so again this tells us this tells us how to adjust our policies so to get more or less of something and this tells us whether it was good or bad so if you have a really big positive reward you want to move more in the direction that gets more of that thing if you have a really low negative reward a really large negative reward you want to move in the opposite direction to your gradient okay any questions St silence okay um so yes can I ask um this is model three right yeah and we have the capital R means no rewards and everything okay so so notation this capital r is the the random reward the actual reward that you experience this curly R is the um reward function for the mdp so we've expanded this in terms of something which actually is described in terms of the Dynamics of the true mdp but we brought it back again to an expectation where we only need to look at the the sample reward that the agent actually experiences so this is completely model free so model free what this means is if we want to estimate this expectation we just take an action from a given State we compute the gradient of this log policy which we know because it's our own policy and we multiply it by the reward that we actually experience and that gives us a gradient step and we follow that gradient step so it's model three good question okay so we want to do the same thing in in mdps multi-step mdps not just these one-step things and to do that we actually just need to do one trick which is to replace this instantaneous reward that we had so we were multiplying the score by the immediate reward and to to do a multi-step mdp all we need to do is actually replace this immediate reward with the value function the long-term reward and that actually turns out to be the true um gradient of the policy and the policy gradient theorem tells us precisely um what that means this is just to formalize that which just to say that if you start with some policy um then for any of those objective functions we considered earlier the start State the average award the average value um the policy gradient is basically given by this thing here um which is some expectation over the score function multiplied by the action value function Q so it basically tells you you know again how to adjust the policy so to get more or less of that particular action multiplied by how good that particular action was or that particular action is in expectation so you want to adjust the policy in the direction that does more of the good things and less of the bad things and if you were just doing sup supervised learning and maximum likelihood learning you'd have something which looked very very similar um but you just wouldn't have this value function term here you would just be adjusting things in the direction that made your policy achieve more of the things that your teacher tells you and so what we have in reinforcement learning is instead of that we adjust it for all actions we just have to try those actions and see well are they good or bad in our according to our value function okay so that's the policy gradient theorem um let's consider the simplest way to actually use that now to make an algorithm and we call this the Monte Carlo policy gradient which is roughly the same as a really old well-known algorithm called reinforc um and the idea is to update the parameters by stochastic gradient Ascent so we're going to get rid of that expectation now and just do something very practical and we're going to use the PCY gradient theorem and what we're going to do is just sample that expectation um and the way we're going to sample that expectation is just to say um what we wanted to the term which appeared in the gradient theorem was this Q so let's just sample Q let's just estimate Q by using the return um as an unbias sample of this Q so we're going to be in a state we're going to start by taking this action and we're going to see what return we got and we're going to use that as an estimate q and we're just going to plug that in um into our policy gradient um to give us an Direction an estimate of a direction to move and so every single episode now what we're going to do is we're going to say you know at each step we're going to adjust our parameters a little little bit in the direction um of this score multiplied by the return the return that we actually got from that point onwards and so what would that look like in an algorithm well you'd start off you pick some arbitary parameters you run your episode and for that episode um so first of all you run forward you generate all of your your rewards this is like a forward view algorithm so you have to go all the way to the end of the episode generate your rewards at the end of that you know what the return was from each of your steps now so now go over each of your steps again um and you say um for each of those steps we want to adjust our parameters a little bit in the direction of this gradient the stochastic gradient um which is the score how to adjust up our parameters in Direction um that gets us more or less of something multiplied by this sample return from that step so that it's as simple as that it's a very simple algorithm so this is a the the most straightforward approach to policy gradient you just sample returns and you adjust policy in the direction that gets you more of those rets and GT in this case is R GT GT is the sum of the rewards it's the it's the return so it's the sum of the the rewards from the beginning of the uh from time step T onwards until the end of the episode so it's the same definition of return that we've used before so GT is the return um accumulated rewards from that time step onwards um okay so here's a little example um this is using that same Monte Carlo policy gradient um with a particular um policy uh parameterization and the idea is that you start off with this uh this puck in this world and and this Puck is very sliding along there's very low friction in this domain and you just get to exert a small force on it every time step which can just slightly protur its direction so if you're not careful it can really overshoot its its Target and then the target is given by this p again which moves around every now and then um and it has to try and reach that Target um and so if it it gets a reward when it gets close to the Target and it's reset every 30 seconds and then you just go you just run trajectories and you use this Monte Carlo policy gradient um and you upda in the direction of those returns so you always you measure your return you go back over it and you adjust your policy a little bit in the direction that gets you more of those returns um and then you see how how this works and so there's two things I want you to take away from this the first is that you get this very nice smooth um learning curve value based reinforcement learning tends to be much more jaggy you end up with sometimes collapses and sometimes Chatters but you can be uh can do very well um the second thing to take away so solves this problem very nicely but the second thing to take away is the the scale here that I mean we're talking you know 100 million iterations here to solve this problem um so it's slow Monte Carlo policy gradient methods tend to be slow um they're very high variance they're looking at this complete um return of what happened estimating the gradient from that return and moving in the direction of that estimated gradient and so the rest of this class is going to be about using similar ideas but making them more efficient and how to take this very nice idea where you get this smooth learning where you adjust the parameters of your policy to solve a problem but make it more efficient reduce the variance make it work better okay and so that's going to bring us to the final family of algorithms act to critic methods so just a quick recap um so the main idea is that we've got this High variance estimators of the of the gradient now so we had this policy gradient we plugged in the return to estimate the the gradient by taking a sample of our Q values but that thing's very high variance like you know um it might be that I'm playing you know Atari again and on one particular episode I might get a score of 1,000 on next episode I might get a score of zero and that's just in the randomness of what happens because you know over the course of 10,000 steps there are many many different random events which might occur this thing's very noisy and the main idea of actor critic methods is that instead of using the return to estimate the action value function we're going to explicitly estimate the action value function using a Critic using a value function approximator we're going to bring back the ideas from the last lecture combine value function approximation with our policy methods so again we're going to have our true action value function and we're going to estimate this thing using a function approximator and then we're going to plug this in to our policy gradient as a substitute for Q Pi um so what this means is that we now have two sets of parameters and that's why these are called act critic methods we've got a Critic and an actor and so intuitively the actor is the thing which is doing things in in the world it contains the policy it's picking actions it's actually making the decisions of what to do in the world it's called that's why it's called the actor whereas the critic doesn't actually take any decisions it's just watching what the actor does and seeing whether that's good or bad evaluating that thing saying oh you know those decisions were good they got a score of a thousand or they got a score of minus a thousand so we're going to combine those two things together the actor and the critic and the main idea is now to use an approximate policy gradient instead of the true policy gradient we're going to adjust the actor we're going to adjust the policy in the Direction which according to the critic will get more reward so the critic is going to say hey I think that if you go in this direction you can actually do better and then the actor is going to move in the direction of that gradient and so the way we're going to do that we're going to take um our original policy gradient algorithm which was an expectation of the score multiplied by the value function that was our policy gradient theorem but we've replaced the true action value function with this estimated approximate value function where we've got our neural network or whatever to estimate this thing and what we're going to do then is at each step we're just going to move a little bit using stochastic radiant Ascent every step we're going to move a little bit in the direction of the score multiplied by a sample from our our own function approximator so the critic is saying hey you know I think this thing is good or I think this thing is bad and then we move a little bit in the direction that gets more or less of the things that the critic says are good or bad okay um so how do we estimate the the action value function how do we estimate Q what do we do for the critic well that part that should be familiar that's what we did last lecture we started off with this problem of policy evaluation um we want to know you know if if I'm following some policy U Pi if I'm following this policy Pi we're trying to estimate this Q Pi we're trying to say how good is that policy we're not trying to optimize it we're not trying to build a critic that estimates qar we're just trying to say according to this current behavior um how good is it how much reward will I get under this current behavior and then we're going to use that to point us in the direction that will improve our policy um so a way to understand this is we want to say how good is this current policy Pi Pi U for our current parameters U we want to know how good is that thing and so we can use everything we've learned so far um Monte Carlo policy evaluation Temple difference learning TD Lambda pick your favorite policy evaluation algorithm lease squares temporal difference learning anything you like throw it into the pot get an estimate of your of your action value function and use that to adjust your actor so here's a canonical example just to make things concrete so let's say we were using a linear value function approximator for our critic so we're going to estimate Q by using some features of our state action multiplied by some critic weights V so in this class we're going to use V for the critic parameters U for the the actor parameters just to keep those separate and distinct from what we did in the last lecture so we're going to estimate our our critic just by using a linear combination of features and then the critic we could just update in the usual way using linear td0 and the actor we can update the actor parameters U um by gradient score multiplied by what the critic says um so that would look something like this this is the the Q actor critic um so basically what this is saying is that every step of the algorithm now we don't need to wait until the end of the episode this is now an online algorithm which we can every single Step perform an update because we're using TD in our critic we don't need to wait all the end like we do with Monte Carlo so every step of our episode or every step that the agent EXP experience is we're going to sample a transition we're going to see what the reward was we're going to see where we ended up after that step um going to pick an action according to our own policy we're going to get the TD error between um the value before that step and value after that step just like we did in the previous lectures now we're going to update our critic in proportion our critic we're going to update a little bit in the direction of the TD error multiplied by the features so that was linear TD Z so that's from last lecture that's how to update our critic in the direction that minimizes the error between um what we thought was happening before what we thought the value was and what the value ended up being after one step which is like correcting our errors and making ourselves self-consistent um and we're also going to adjust the actor we're going to adjust the actor in the direction um that gets us more of the things that the critic says are good so that's the actor critic idea using um canonical case using linear TD okay any questions so far I'll just pause there before we we're going to start adding you know more tricks to reduce variance make these algorithms more efficient so so if you're if you're feeling uncomfortable now is the time to say because we're just going to layer more more pieces on from from here on in yes good um so you have a policy you get a estimate and then from that estimate you try to get another policy is that true that's right yeah so so we can think of this as another form of generalized policy iteration where we start off with a policy we evaluate that policy using the critic and then um instead of doing greedy policy Improvement we're moving a gradient step in some direction to get a better policy so it's just like our familiar Machinery where we're making the policy better and better and every step we're evaluating that policy moving a little bit up the gradient to to find a better policy my question is choosing the first policy does it differ choosing the first policy you can pick an arbitrary policy um so you initialize your policy parameters however you want it could be it could be typically now we've just got parameters though we're not using e anymore we've just got some parameters and those parameters determine our policy I'm scar okay um and we're making those policies uh we're following the gradient so as to make those policy parameters better so that our policy gets more and more award out of the mdp okay um so we start off with some policy parameters that determines our Behavior we want to make that behavior better and better and better there's no greedy anymore no Epsilon greedy the policy itself determines how we how we move around in this environment these policy parameters we're always picking we're always picking actions according to the policy that's determined by this U these parameters you so yeah one more question but if your if your policy was Epsilon gitty your Q function M do you like fall back to Q um let me go back to a slide that I kind of skipped at the beginning because the projector was down this one okay um we had this in the introduction but I think it's worth reiterating um so let's classify our algorithms into either whether we have a function that we represent whether we have a policy that we represent or whether we have both that's what this V diagram represents if we have a value function if we represent that value function we call it value based if we have a a policy that we represent we call it policy based and if we have both of these things in our algorithm we'll call it an acor critic okay um and Epsilon greedy Epsilon greedy is something we do when we when we live in this side of the diagram Epsilon greedy we don't actually explicitly represent the policy instead we use a simple trick which is say if we've got the value function if we've got Q we're just going to act greedy with respect to q and sometimes act randomly what we're considering now in this lecture is an alternative approach which is to actually learn directly how to pick actions going to parameterize the policy so instead of doing greedy or Epsilon greedy we're directly parameterizing the policy we're deciding how to pick those actions and we want to figure out well if I pick my actions in a particular way will I get more award will I get less reward um and um and so they we actually in a different space here Al together where we're directly parameterizing things so there's no greedy there's no Epsilon greedy anymore but have you parameterized your policy to be like greedy over your your value function if you parameterize it so that's like an implicit parameterization which I would say is what we call Value based here so it's like if we if we're basically saying our policy is the thing which is acting greedy with respect to my value function that's purely value based we're not explicitly representing the policy but does that algorithm still work is there a way to recover a pure value based method um there's a result which says that basically um for a particular class of of of um policy gradient algorithms that we'll see shortly um it turns out that if you take the limiting case where you make your step size bigger so for all of these policy gradient algorithms there's some step that you take in the direction of the gradient and you could ask the question what happens if you take an infinite step in the direction of that gradient um so if we let's get back to where we are so in all of these algorithms there's some there's some gradient update where we update our parameters a little bit in the direction of our score and for a certain class of policy gradient algorithms if you make this thing infinitely large um then it actually takes you to a greedy step so that's true for certainly for natural gradient algorithms which we'll see later so the step size is what kind of controls how greedy you are so think about the softmax policy okay this thing where we were exponentiating some score if you take a if you see that like one of these actions appears to be a little bit better than the other um so so you had you had some policy where you were taking this left more often than right and now you see something where left actually did even better than right so you saw a good score for this guy what are you going to do um well you're going to push yourself a little bit in the direction that makes left even more probable and right less probable if you make your step size infinite you'll push the probability of going right all the way down to zero and the probability of going left all the way up to one um which is equal to acting greedily so there is a limiting case where you recover kind of greedy greediness and the step size is what controls the smoothness um so so it's certainly true for some algorithms yeah I'll take one more question and then move on Sor so when when you were learning a value function yeah you had the development operator being contraction so you you knew you were going to get to the optimal value yeah um do you have the same thing here because you could have sub optim okay this is a really great question so the question was you know if we're using policy gradient methods do we still have the same guarantees that we'll find a unique Global Optimum or could we get trapped in some local Optimum um so it turns out that at least in the case where we're using so you could consider table lookup representations of policy um so let's consider the table lookup case in the table lookup case you can represent your value function by having one value for each state um and if you use the Bellman um equation you get a contraction you guarantee that you get a global um Optimum um with policy based methods if you just follow the gradient for example with the softmax it's known that for the softmax um that you also find the global Optimum in the table Lookout case so if you have a softmax which is like a separate softmax parameters for each state you also achieve a local Optimum now for the case where you've got some more General function approximator um it's clear that if you've got something like a neural net neither method will guarantee that you find a global Optimum you can always get trapped in some kind of Optimum for certain special cases in between it's unclear and I think it's an open research question actually right okay so so we're still in this space where we've got like this basic principle we've got this family of actor critic methods we've got this family where the actor is basically picking the actions we've got some policy we're using that to pick the actions we've got some um critic which is basically evaluating those things and saying you know whether those actions are good or bad and then actually the actor moves in the direction suggested by the critic that's the basic idea of all of these but what we want to do is to make these things better so we're considering now now some some tricks to make make this thing better so the first trick and the perhaps the easiest and best known trick is is to reduce variant using what's called a baseline um so the idea is to subtract some baseline function from the policy gradient and this can actually be done in a way that doesn't change the the direction of of ascent in other words it changes the it it changes the variance of our estimator changes the we can reduce the variance of this thing um without changing the expectation um so another way to say that is that what we're going to do is we're going to subtract off some term which looks like this from our policy gradient so our policy gradient was the score um multiplied by the value function and we're going to subtract off that subtract off the score multiplied by some baseline function and now we're going to choose that Baseline function just to reduce the variance to kind of make this thing mean zero to make it roughly about the right scale so that um so that we basically you don't want to end up in situations where where one moment you see a reward of a million and the next moment you see a reward of 999,000 um and you have to kind of adjust in the direction where you're always moving a policies um parameters up but it's just the relative difference that determines how how much up you should make them be much better to have like plus one or minus one um and and so this gives us a way to rescale things um and so so very very briefly what we see is that if we just expand this out so our expectation is just this again is our expectation Over States um multiplied by the gradient of our policy multiplied by the Baseline um we can pull um the Baseline outside of this sum because it doesn't depend on the action uh we can pull the gradient outside of this sum and then we know that our policy sums up to one this is a probability distribution so the probability distribution must sum to one um so now we have the gradient of one gradient of a constant is always zero um and so we see that this whole term here is actually zero like this whole term that we're adding or subtracting is is zero mean it's got zero expectation okay so it's completely legitimate to add or subtract any term of this form so what that means is that we can basically um whenever we we have our our value function which we so if I go back um whenever we had our value function here this was our our policy gradient theorem and our policy gradient theorem told us that the direction we really want to move is this score function multiplied by q but now what this result tells us is that we can add or subtract anything we want from this Q as long as that thing is just a function of state and not a function of action we can add a anything we want from this so as to control the variance of this whole term here we want this expectation to be low variance but we don't want to change the direction of of aent um so we can do that and there's a particularly nice choice that we can pick for this thing um which is going to have some nice consequences for us um and the particular Baseline we're going to choose is the state value function okay so what we're going to do is we're going to start off with our Q values our action value function and what we're going to do is subtract off the state value function and what we're left with is something called the advantage function this is something which tells us how much better than usual is it to take action a so we basically we're going to look at the Q value of action a which tells us how good is it to go left if I'm in a particular State compared to how good is it to be in that state in general and that's that tells us how much better than usual how much more reward than usual will I get if I take a particular action and so intuitively that's just the right quantity to use in our in our update so we call this thing D the difference the advantage function um and so what we're going to do is now rewrite our our policy gradient theorem in the following way so this is still the policy gradient theorem this is still just ground truth we've just Rewritten it by subtracting off a particular Baseline function um and so a policy gradient theorem we can rewrite as the expectation of the score multiplied by the advantage function so again let's let's see if we can understand this intuitively it's basically telling us this tells us how much better than usual um is a particular action a and this tells us how to adjust our policy so as to achieve that action a so if this thing is ever positive if we ever do better than usual using a particular action a this will tell us how to move in the direction that achieves that gain if this thing is ever negative this will move us in the opposite direction away from the direction that that achieves that particular action so this is always pushing the policy parameters towards situations where you do better than usual okay so that's the advantage function so again we're just rewriting the same policy gradient theorem so how do we estimate this Advantage function now how do we do this in our critic well there's a lot there's a lot of different ways to do this actually and I'm just going to suggest a couple of them um so this slide is just to say well one way to do this would be to learn both q and V so our critic would basically learn Q it could also learn V using yet another set of parameters um and we would just take the difference between those things as our estimate of the advantage function so we could basically do that this becomes more complex in the sense we have more parameters but it gives us a literal estimate of the advantage function which we' then plug back in here so we can plug into this policy gradient theorem here we plug in this advantage function the difference between our our our state action value function and our action value function there's an easier way and probably a better way um and that's what this second slide is about um so this is probably the most commonly used variant of the actor critic although it depends who you talk to and many people use many different variant um and it uses the following idea which is that the TD error is um a sample of the advantage function so consider the following TD error this D Pi this Delta Pi thing here so if we knew the True Value function V Pi then this basically tells us that the TD error is equal to the reward plus the discounted True Value at the next State minus the True Value in the state I'm in that's just the definition of this tdor now in expectation we can take the expectation of this whole thing we can say well what's the expected TD error the expected TD error is the expected reward plus discounted value at the next State minus the value of the state I was in the expectation doesn't affect this thing because there's no random variable in here s was s is what we're conditioning on um this term here the expected reward plus State value at the next state given s comma a is precisely the definition of q this is just like unrolling our Bellman equation one half step um so Q is equal to the expected State value at the next step and so this whole thing here basically tells us that the TD error is an unbiased sample we've basically got our Q minus V now this is the advantage function this whole thing tells us that the TD error is an unbiased sample of the advantage function so if we want to move in the dire of the advantage function all we need to do is to measure the TD error and move a little bit on the direction of the TD error in other words if we ended up I'm in a state where I think I'm winning a game you know I make an action I find myself in a situation where I'm losing the game so that generates a TD error um these things were inconsistent I now have to think oh well I was probably losing the game all along so you end up with this strong negative TD error that tells you you need to reduce your your Value Estimate and so now what we're going to do is plug in that TD eror as an estimate of the advantage to say that whatever I did on that step that generated that TD error whatever move I made um you know it was probably you know a bad idea there was something which I was in a situation where I took an action that took me to a situation where I started off thinking I was winning and I took some move and now discovered that I'm in a situation where I'm losing so probably that was a bad move and so that's the intuition here we're now going to plug in that TD error as an unbiased sample estimate of the of the advantage function um and the nice thing about this is that we only need to estimate V in our critic we don't need to estimate Q we just need to estimate the state value function there's no actions that come into it um and and what we do is we just plug in so this would actually be another way to rewrite our policy gradient theorem again this is just a rewriting of the policy gradient theorem we haven't changed anything there's no approximation yet but if we use a Critic to estimate V if we use a v hat now instead of the true V pi and we we use the TD eror based on our our approximate TD on based on our approximate um value function and we just plug in that that estimate of of the TD error there then we end up with something which is a good approximation to this policy gradient theorem that's a practical algorithm we estimate V we generate TD errors using our V and we move in the direction of those TD errors and we only need one set of critic parameters B to do that okay so just um a couple more ideas to throw in for the this family of different things that you can try and then I'll summarize and just bring them all back together I've got one slide at the end that hopefully has just like a summary where you can come back and say you know these are the different things that you can do with AC to critic um so this one is really just a um a reminder of of of the previous lectures which is to say you know what about what about this idea of Eligibility traces and different time scales so if you remember from the previous lectures we've often reused this idea that you don't always want to go all the way into the end of the episode and estimate the return and neither do you necessarily just want to take one step and bootstrap from the value function after one step because that that's biased you often want to trade off the bias and variance between these two things using something like TD Lambda so now we're going to do the same thing with actor critic algorithms it turns out that it's quite straightforward to do that um so if you remember from last Le lecture we had a choice of things that we could plug in so this is last lecture you could plug in when you were updating your value function approximator you could plug in the return you could plug in your onestep TD Target you could plug in your Lambda return you can make any of these the target um for your for your updates and then we also had this idea of Eligibility traces which basically gave a backward view that was equivalent to plugging in this Lambda return in the forward view so hopefully this is starting to become a little bit familiar um if it's not and go back to the notes and um because it's um it's useful stuff and key idea so what we're going to do now is to plug in exactly the same idea to our actors so I mean first of all we should say that you can you can do all this with um with the critic um and sorry I think some of these should say V rather than W but you can do all of this with the U with the critic where you can basically you know as I mentioned before the critic we can evaluate our policy by any means we like we can plug in any algorithm we want for our policy evaluator it could be td0 could be Monte Carlo could be TD Lambda lease squares policy valuation any of these are valid and and then and so we can introduce different time scales onto our critic um by choosing a TD Lambda on eligibility Trace but that doesn't affect the actor yet it doesn't affect the actor and the question is can we also get these different time scales on the actor can we get information to flow back more efficiently in the actor so the actor is not just based on the current eror but based on something which bootstraps from all different time steps and the answer is yes um and so very briefly all we're going to do is basically um we're going to plug in so this is again our our original this is the true PCY gradient theorem the score multiplied by the value function or the advantage function equivalently um and what we're going to see is that there's different things which we can plug in so far we've seen two different examples we've seen that we can do the Monte Carlo policy um gradient um which looks something like this where we plugged in the return so we Ed the return multiplied by the score as our estimate and we subtract off some baseline here so if we always just subtract off this Baseline value function here you end up with a return minus this this mis value function multiplied by the score when we used um the TD error we ended up with the the one step Target the TD Target the rewards plus the discounted value at the next step minus our Baseline okay so these are different targets when we use the Monte Carlo Target we end up plugging in the return when we use um this TD era version where we're estimating the advantage function using the TD error we plug in the TD Target and subtract off the the value function as a baseline so these are equivalent um these are different sorry these are different ways to estimate our original policy gradient theorem these are different approximations um and sometimes it's a good idea to do this this is a high variance low bias estimator of the of the policy gradient and sometimes it's a good idea to do this where we introduce bias by bootstrapping from our value function U but we dramatically reduce the variance and again the goal is to to get the best of both worlds where we want to come up with a gradient estimate which boot straps from our value from our critic all kinds of different time steps we don't just want to rely on our critic at this time step to to estimate gradient we want to use the critic so we want to be able to say well the critic at this time step says I should go this way the critic at the next time step says I should go this way so what we really want to do is to combine these together and actually move in some combined Direction so that's the idea of using eligibility traces and it's exactly analogous to the way we do it in value based reinforcement learning where we compute our TD error we build up an eligibility trace the eligibility Trace now um basically is an eligibility over our scores so we basically remember all of the scores that we've seen recently um and so if I had a high score for this particular action in this particular State um we remember that we bump up our eligibility trace and then later on when we come to make an an update U we move in the direction where the the scores have been largest most recently and most frequently so the way to understand this is the best way to understand this is by analogy with eligibility traces in the value based case where you can literally look at the updates and you can see that wherever you had an eligibility Trace in the value based case um the updates were identical like if we were doing value based reinforcement learning we would look at say our Lambda return um minus our current value and we would have just multiplied directly by the gradient of that value function not by the score so all we need to do is like a search replace wherever we had the gradient before we search replace with the score and that will actually work and give us a legitimate eligibility trait instead okay um so summary of this idea so we're building up our toolkit of ideas that help us to make acor critic effective in practice so now we've got a new piece to our toolkit which is El eligibility traces how to make our actor make use of critics from many different steps all the way into the future and the nice thing about this approach as well is that unlike Monte Carlo policy gradient we can apply this online we can um basically use critics that that build up um things going far far into the future but we can do it online and we can apply this in incomplete sequences or in in situations where um where the environment is continuing never stops or in situations where we're working off policy okay um so I think what I'm going to do is I'll talk very briefly about this so there's a really important question in actor critic algorithms which is in some sense you know it should be the first question that you ask about at critic algorithms which is that we've used this critic and we just kind of said um let's replace the true critic value with the some approximation to that critic value and we just plugged in that thing and hoped that the gradient that we follow is still correct in other words we we started off by deriving the policy gradient theorem we said this is the true gradient to follow if you really want to improve your policy and then we substituted in something we substituted in a Critic and we said okay well you should follow what the critic says instead of the True Value function but how do we know that that's valid how do we know that this really is pushing Us in the correct gradient Direction and we're not actually following some bogus gradient um and the answer is that amazingly if you choose the value function approximation carefully that you use it's possible to pick a value function approximator that doesn't introduce any bias at all in other words despite the fact that we don't have the True Value function and we're approximating the value function we can still follow the true gradient we can be guaranteed to follow the true gradient with our policy updates um and that approach is called compatible function approximation um and compatible function approximation I won't go into the details of it but you the main idea is to say that the features that we use to have a compatible function approximator the features that we use are themselves um the score function we basically build up U features for our critic where the features are the score of our policy and if we use that particular type of feature and we use linear combination of those features then we actually guarantee um according to the theory on this slide and the following slide which I'll leave to look at afterwards we can actually guarantee that we don't affect the policy direction that we actually still follow the true gradient direction if we follow these compatible features okay now I know that I'm throwing a lot of ideas at you and there's a big soup of possible things to follow I want to throw in one last idea um this is a recent idea I think it's a useful idea to know about um and and it's just this one slide which is to say that so far we've considered these stochastic policies we've considered these policies like a gan policy where we say I want the mean of my policy to be this thing and I'm sometimes going to take actions which are like the mean sometimes I'm going to perturb it a little bit by some noise um but this thing can be very very noisy like we're actually so far we've been estimating our policy gradients by sampling our own noise we've basically got a noisy policy and we want to take an expectation over that noise and it turns out that taking expectations of the gradient of our own noise can be a really bad idea that this thing is really really hard to estimate and in fact if you start off with like a gum policy and you start to improve this thing over time and you'd hope it would start to narrow down as you start to figure out how to solve the problem that as this gum becomes narrower and narrower and narrower the variance of your estimates actually starts to blow up to Infinity like it turns out that estimating the policy gradient becomes harder and harder and harder the noise basically hurts you more and more and more as you get better and better and better at finding the right policy and that's sort of an unfortunate property of the policy gradient algorithms we've seen so far um turns out that there's an alternative and this is something we did um last year uh which is to work directly with the limiting case so instead of adding noise into our policy and then trying to narrow this noise down to something which is roughly deterministic what if we just start off with deterministic policies so we're going to start off with some deterministic policy um and we're going to adjust the parameters of this policy policy this deterministic policy so as to get us more objective going to follow the same objective functions we did before and it turns out that um if you just take the limiting case of the policy gradient theorem that we've seen so far you end up with this very very simple update which has a slightly different form to what we've seen so far but again you should think of this as just another rewrite this again is exactly equivalent it's just the limiting case when we actually consider a deterministic function where we've narrowed our noise down to zero basically just got a deterministic function now we're just picking the mean always we're never adding any noise in and it has this particularly nice form where we say all we need to do to follow the gradient to adjust the parameters of this policy this should have a a u there as well if we just want to adjust the pars of this policy to get more objective all we need to do is to um take the gradient of our own Q function so we look at our critic our critic basically tells us hey look you know if you were um it gives us the gradient that says this is the way that we'll give you better actions it says actually look this is how you this only works for continuous action Spaces by the way but it basically says here's the gradient of your value function with respect to your actions this is how if you were just to make this action here a little bit higher you would get more reward if you were to make this other action a little bit lower you'd get less reward so the critic already contains all that information about how to adjust the actions so as to get more or less reward so that's all there in the gradient of q that tells you how to get more reward like just you know just move left a little bit more and you'll get a lot more reward that's that's this gradient of Q and then all you need to do is to work out how to adjust your parameters so as to get more of those actions which are which the critic is suggesting and so that's the gradient of the policy here so this is just the chain rule it's the chain R it says adjust the policy in the direction that gets you more q and so that's the deterministic policy gradient theorem it's very simple intuitive and in practice it seems to work a lot better than SC gastic policy gradient um in situations where we've got continuous action spaces scales up much better to high Dimensions okay um I'm going to skip the natural um actor critic um so there's one more approach which is in the notes which is the natural actor critic um natural act critic is an alternative descent direction or Ascent Direction um which has some nice properties and it turns out it falls out very nicely in the acoc critic framework that's there in the notes feel free to refer to it um but I wanted to put up this summary slide before we finish um and this is like a summary of all the different manipulations that we've done um to our uh to our different actor critic algorithms so we started off with basically our policy gradient theorem what we did was we plugged in different approximations to this thing um where we said okay the gradient of our of our policy we can get it by taking the score function and multiplying it by by the return uh we can multiply it by the value function we can multiply it by the advantage function so that should be a D um we can multiply it by the TD error this was another way to estimate the advantage function we take a sample of the advantage using the TD error um we can use an eligibility Trace where we accumulate our scores instead of just using the current score we accumulate an eligibility over all of our scores over many time steps and we use that eligibility Trace instead of just the one current score um and we can use the deterministic actor critic where we basically move in a direction that gets us more Q okay all of these are essentially different variants of the same idea they're all different ways they're different manipulations of trying to say move the policy in the direction that gets you more value and then the way that we estimate values in the critic um is what varies here um and the way that we make use of this how we update our policy is what varies between here and here and here so the the critic can vary the actor can vary but essentially they're all doing roughly the same thing and for each of these cases we can make a stochastic gradient Ascent algorithm where essentially all we do is we just drop this expectation by sampling the thing which is inside and in each case you end up with a very very straightforward simple algorithm where all you do is you you take a step or an episode if you're doing the top one um and you look at what happened in that one step only um we now have the gradient of the score for that step we took an action from some State we know for that state how to have adjusted our policy so to so as to have taken that action more often and we look at whether that step was good or bad according to our critic if it's good we push our policy parameters to do that thing more often so all of these have very simple stas IC updates um and then the critic we basically use our favorite algorithm Monte Carlo TD TD Lambda you know it's the whole family of different approaches that we've we've considered before in previous lectures okay that's it any last questions before we in the last 30 seconds okay thanks everyone see you next week eokay so a couple of announcements just before we begin sorry about that it's a very temperamental projector um so first of all um the final lecture lecture 10 um I sent an announcement to the mailing list but just in case you didn't catch that that will take place the morning of the 1st of April so this isn't an April Fool so if you thought this mail came to the list and you're thinking oh I don't have to wake up that day um there really is going to be a lecture um I appreciate that that is outside the time where classes normally occur um as a result um this is an optional lecture um so it's going to be a lecture on games case study on classic board games with reinforcement learning um I think it's interesting it'll pull together a lot of elements of the course um but it's optional so if you've already booked your time away and you can't attend then don't worry it's not going to affect your um your exam um but if you do have the chance to come along it should be interesting and um I'll also mention that all of these uh videos will be made available online very shortly um so if you do miss anything you can catch up online um second announcement is the deadline for the assignment um so that will be um the following day April 2nd so hopefully um you guys have made a start on that um if you do have questions um please send them to the to the mailing list because if you're wondering about something probably means 10 other people have the same question and so it's quite useful just to Ping those backwards and forwards on the mailing list to just make sure everything's Crystal okay um so let's get going so today's lecture is on on um policy gradient methods um this is a very popular widely used Class of reinforcement learning methods again we're in the part of the class where we talk about things which are actually um used in practice to solve interesting challenging reinforcement learning problems um so just to give a an outline of how this is going to pan out uh we're going to start off with an introduction where we'll understand roughly what's going on um and then we'll talk about uh we'll focus in on policy gradient methods so what's going on in this class is we're really talking about methods that optimize the policy directly so instead of working in value functions as we've considered so far this is now about working with the policy how can I see some experience and from that experience figure out how to change my policy in a direction that makes that policy better um and what we'll do is we'll see um a variety of different methods for doing this which are all based on the simplest principle if you like for improving a policy which is policy gradient so we adjust the policy in the direction that makes it better we follow the gradient we'll consider three different ways to do that we'll start off with um a sort of Na naive but sometimes effective numerical approach by using finite differences and then we'll move on to see that actually um for many of our reinforcement learning methods there is an analytic formulation for the policy gradient so despite um what you might think you can actually just have an agent which follows its trajectory sees how much reward it gets along the way but also knows which direction to adjust its policy so as to move it directly in the direction that makes its its policy better and finally um the most practical set of methods which we'll consider are the actor critic policy gradient methods so these are methods which combine everything from this class with everything from the previous class so they have um they work with both value functions and with policies and try to get the best of both worlds okay so introduction so last lecture um we had lecture where we talked about value function approximation we considered two different types of value function approximation we considered approximating the state value function so this is the true State value function that tells us how much reward um can I really get in this mdp in expectation if I follow this policy Pi so if I'm in some State s what's the true expected reward I'll get from that point onwards um or we had the the Q function which tells us for some State s and some action a what's the true cumula accumulated reward along some trajectory that I can get from that point onwards taking that action under this policy so those are the ground truths and what we did in the last lecture was we approximated those things using a function approximator like a linear combination of features or a neural network we tried to estimate this thing we had some parameterized function which had parameters w we basically tried to fit this thing using ideas which are familiar from supervised learning we Tred to fit a function to this ground Tru so we SED started to understand the shape of how much reward can I get from different parts of the state space and different parts of the action space respectively um but in and in that in that lecture so last lecture the policy which we then used when we actually wanted to do control um we didn't represent the policy explicitly instead we just used the value function to pick actions so once we got our our function approximator for Q that was enough to pick actions because we could just act greedily or perhaps Epsilon greedily um to pick the the action which gets the most Q um so that was a very effective approach to control using function approximation um but it's only one approach to function approximation and in some sense uh more direct or um sometimes more natural approach is to directly parameterize the policy not the value function so now we're can to have a different U function approximator we're going to have some parameters U so in the last lecture we had W so the difference between last lecture and this lecture is like whether it's w or U um and we're going to use these parameters U and we're going to param the policy now so we're going to parameterize this distribution our policy Pi is now going to be something we directly manipulate so we control these parameters which affect the distribution by which we're picking actions so think of this if this was deterministic which we'll consider later on this could just be a function the policy could just be a function you know which action do we pick in in a given State and now we learn the parameters which tell us you know for each state um which function should I which um which action should I pick and so again you could think of this as some arbitrary function approximator like a neural network where you plug in your state um and it tells you which action to pick or distribution of actions to pick and so formally this policy is going to be now probability distribution over actions that's conditioned both on the state and our parameters so we're actually defining probability distribution and we're tweaking we're learning how these parameters should change the probabilities by which we pick different actions in different states and what we're going to do is try and understand how to modify this parameter Vector U so as to solve the reinforcement learning problem so as to maximize the amount of reward that we can get in this environment and again what we're going to do in this lecture is we're going to focus on model free reinforcement learning so we're really going to focus on how to throw down your agent throw your robot into some Maze and and this thing just wanders around and just directly from experience without anyone telling it the Dynamics of the environment it should be able to figure out how to adjust its parameters of its policy so as to get more reward from the environment and in case it's not clear the main motivation behind both of these methods is that we want to be able to scale we want to be able to scale to large interesting mdps large complicated environments um in which it's not possible to separately for each distinct State say oh I'm going to take this action here in this state and this action here in this state so we need a function approximator we need some parameters that sort of specify um how across this state space we're going to modify these parameters uh apparently I spent too long on that slide Ah that's the power so while we're waiting for my computer to come back to life any questions hopefully it's um the motivation should be clear um so so what we'll start to do now is to understand how we can actually take this parameterized um policy with this parameter Vector U and start to adjust those parameters so as to get more reward and the main mechanism that we're going to consider is gradient Ascent in other words how can we compute the gradient to follow so as to make this policy better and if we can follow that gradient we will strictly be moving uphill in a way that improves our policy if we know the gradient with respect to the total reward we're going to get in this mdp we're just going to follow that gradient in a direction that gets us more award what we're going to do over the next few slides is formalize what that means if my computer comes back to live um so that we can start to understand how to maximize reward um and we'll consider a few different ways that we can formalize this objective function um so before we do that I'm going to ask something to you guys while we're waiting which is can you think of any advantages or disadvantages of using policy gradient um compared to value based methods like why why work with policy based methods and you'll think why it might be a better idea to work with policy based methods than value based methods good uh you need to store few things if you're working right so you might need to store more fewer things um so why why why do you think you might need to store fewer things because you don't have to store the value your approxim right so so there are some situations where it's more efficient to represent the value the policy than the value function there might be situations where imagine you're playing one of these atar games like breakout or something and it might be that the value function is quite complicated it might be that actually precisely working out how much score you might get from some point onwards might be tricky um but to represent the fact that if the ball is coming down over here you should always move left might be a very simple um policy to represent whereas actually working out precisely that you're going to get 1,714 points if you move left might be much more complicated function to to to approximate so there are cases where a policy can be more compact um so now we've got the slides back I'm going to um come back to this yeah this question here so so some other reasons why it might be a good idea to use um policy based reinforcement learning so maybe the main one you'll see in the literature is is that it has better convergence properties that we saw that there are some cases where value based methods can oscillate or chatter or even diverge if you do things in the wrong way whereas if you just follow the gradient of your policy um these methods tend to be more stable you're smoothly updating your policy you don't get these dramatic swings in what your what action you're going to take because you just make a little incremental change to your policy and you move in the direction that makes your policy better and so they tend to have better convergence properties and indeed if you just directly follow the policy gradient you're guaranteed to converge at least to local Optimum um another big one is that these methods are effective in continuous action spaces or high dimensional action spaces with a value based method you have to work out how to compute the max and I would say this is probably the number one reason to use policy gradient methods is um they circumvent the requirement to have this Max so if we're doing something like Q learning or indeed if we're doing something like Sasa there's always a maximization I've always been arguing that well once we've got the Q um once we got Q all you need to do to pick actions is to maximize over a and you pick the you pick the action with the highest q and you're done you you you know you found the the best way to behave but what if that maximization is itself prohibitively expensive what if you've got you know a trillion actions or a continuous action space What do you do this space this is actually quite common if you've got a robot it operates in a continuous action space of of TS and so forth if you've got if you're trying to I don't know solve an internet based problem where trying to create the perfect um advert to show to your um to your users um actually creating that advert might have many many different components that you'd like to move around and you'd like to make a sequence of of adverts that you show that actually maximize your your income with that user this is a vast action space so what do you do um well you can solve this complicated maximization every step or you can use policy based methods where you just adjust the parameters of your policy directly and you never have to solve for this map it's like you're incrementally starting to understand what the max will be rather than estimating the max directly every step um and the third reason that you might want to consider these policy based methods is that they can learn stochastic policies in the next couple of slides we'll see why that can sometimes be a good idea um so disadvantages so the main disadvantage is that naive policy based reinforcement learning can be slower and higher variance and less efficient than value based methods I think the way to think of this is that value based methods when you use the max are ex aggressive it's like this Max in one step is trying to improve your policy in the direction that absolutely pushes you all the way to what you currently think is the best policy whereas policy gradient methods just take a little step in that direction they smoothly update in that direction which makes them more stable but also sometimes less efficient and sometimes actually Computing that step to take that gradient if you do it naively it can be higher variance and slow down the whole algorithm but we'll see later how to get the best of both worlds by including a value function and in practice you may want a value function as well which can add to the complexity of the algorithm a little bit and just the flip side of the point which was made earlier sometimes there are examples like we discussed in breakout where a policy can be more compact um but it's also conceivable that there are cases where value function gives you um a very nice compact representation and gives you a more direct way to learn your U solution to your mdp okay so so so let's consider how can it be the case why would we ever want to learn a stochastic policy you know ultimately aren't we just maximizing don't we just want the thing which maximizes don't we want to maximize our reward and if we're always maximizing reward intuitively you might think a deterministic policy should be good enough because deterministic policy is always some you you can always find it um maximizing something um can be a deterministic process but there are cases where it's actually desirable to have a stochastic policy and we'll see two of those now um so the first example is rock paper scissors or Ro Lambo or it has lots of different names depending where you're from and how you play this um but it's just the game where you show paper scissors or stone and scissors beats paper rock beats scissors paper beats Rock um and so you probably know this but if you were to just play this deterministically any deterministic strategy is very easily exploited if I always play rock and I start to play against you you'll very quickly figure out that you should just play paper and and I'll lose every single time and in fact um the only optimal policy in this domain in terms of a Nash equilibrium which is the game theoretic notion of optimality the only optimal policy is to play um actually uniformly at random anything else can be exploited because if you play one choice more often your opponent will learn to exploit that so this is an example where the optimal behavior is actually um stochastic a stochastic policy is necessary a second example is when we have state alus in so this comes up when the markof property doesn't hold so we might be in an mdp but it might be that we we lose the markco property we get partially observed environment so if you remember this distinction between fully observed and partially observed what's going to happen now is we're only going to see certain things about this environment or equivalently we're only going to use features um which give us um incomplete information about where we are in this environment so consider this this little problem here so as you might get yes getting to this treasure is good and getting to the skull and crossbones is bad um and this is just a little grid world you get to move around it go northeast south or west um and the question is um what's the uh best that you can achieve in this domain except we're going to restrict the way that we can create our value functional policy to use the following features so we're going to cons consider features which basically tell us they features of both the state and action so they're going to be features that say things like is there a war wall um to the north of me is there a wall to the south of me um in between these squares by the way if they're adjacent there's no wall between them so this is like is there an exterior wall um and the other part of the feature is is there an exterior wall to the north and I'm moving to the East and that would be a feature of my state end action pair okay um so we're going to consider features like this is there a wall to the north when I'm going east and we're going to consider all such features that is there a wall to the West when I'm going south and um all kinds of things like this um and we want to know you know what's what's the best I can do using either value based or or policy based reinforcement learning um so in each case we're going to approximate either the value function using these features and doesn't matter what the function approximator is this could be linear it could be a complicated neural network it could be your favorite thing the question is what's the best that could be done in principle um and we're going to compare it to policy based reinforcement learning where we parameterize a using some function of these features and the question is which will do better so does anyone have any intuition into this I'll show it over the next couple of slides but just to have a quick think if you can figure out what's going on yeah you're not going to be able to tell the difference between those two gray squares with the features right so so the point was that these two gray squares are Alias there's no way to tell the difference between these two gray squares using the features that you have like if you're here and going west or if you're here and going west those two things look exactly the same to you um because you've got a wall to the north you've got a wall to the South um and you're going west so this your feature Vector will be identical and you can't tell the difference between those two states so as a result um if you use a deterministic policy then you have to pick the same action in those two great States a deterministic policy which is parameterized in terms of those these two things um and and so imagine that you you learn a value function now the value function for these two states has to be the same and if you act greedily with respect to that value function you'll either go west all the time or you'll go east all the time okay um and so this greedy idea of generating a deterministic policy or even a near deterministic policy would basically lead to the following Behavior which is if you're on this side um of the of the map you would oscillate backwards and forwards forever if you're using greedy or for quite a long time if it was Epsilon greedy and you'd never reach the treasure or take a very long time to reach the treasure if you use a stochastic policy um then things are much better because we can we can choose we can optimize our policy so as to choose a policy that goes east or west with 50/50 probability here so now if we're in this gray State we'll we'll just randomly pick a dire Direction and if we find ourselves over here well sometimes we'll go left but sometimes we'll go back right and then we'll be pushed down to the treasure so in just a couple of steps we're likely to reach the treasure compared to a value based method um where if we were to act greedy with respect to our value function we would end up taking in certain cases a very very long time to reach the treasure possibly forever okay so the summary of these couple of slides is to say that whenever State a lising occurs um a stochastic polic can do better than a deterministic policy so we had a theorem early on in the course which was U that there's always a deterministic um optimal policy for any mdp but that's for a markof decision process when you have that was in the table lookup case where we have the perfect state representation um the moment you have state aliasing so you're partially observed um so if you're in a partially observed Mark of decision process or if you're function approximator the features that you use limit um your view of this world which is equivalent of being in a par observe mdp then it can be optimal to use a stochastic policy in which case policy search methods can do better than than these value based methods which just act greedily okay so let's talk about how to optimize a a policy so what does it mean to optimize a policy well we need to know what the objective should be so here's three different objective functions we might choose we want to know how good is this policy um the first choice is um we so we're going to consider some parameterized policy with these parameters U and we want to know essentially what's the best U um so the first case we're going to consider um is if we're in an episodic environment we're going to consider the starts value this basically says if I always start in some State S1 or if I have some distribution over start States S1 What's the total reward that I'll get from that start State onwards okay so this only works if there's this notion of a start state so if there's a start State you know you might say really the thing I really care about is I want to find the policy which when I dump my agent down in the start State I start my game with Atari and I play it all the way to the end it gets the most score um or we could use the average value and this is something we could do in a continuing environment so now an environment which just goes on forever they might not be a start state so now we might want to say well let's consider the probability that we're in any state this is the stationary distribution this is the probability that we actually end up in any state under our policy Pi um times the value from that state onwards so it's like averaging over the values of all states I might be in state one and and get um 10 units of reward from that point onwards or I might be in state two and get 20 units of reward from that point onwards and so we might average over those and say well the objective is is maybe 15 um or we might consider another formulation which is the average reward W per time step which is similar but we might say well what we really care about is just let's consider some continuing environment I'm going round and round and around and all I care about is to make sure that per time step I'm getting the most reward so we're not summing these rewards over time we're just saying we're going to take the average of my immediate rewards over the entire distribution of states that I visit so this is basically saying the objective is there's some probability that I'm in a state with some probability I'll take an action from that State under my policy and this is the immediate reward that I get at that step what we care about is getting the most reward per time step that's another way of saying to extracting the most juice from my environment I get the most reward per time step um this relates to the average reward formulation that was like an extra for um lecture two that you find in the notes okay so it turns out that fortunately for us that exactly the same methods apply to all three of these um they turn out to the policy gradient is essentially the same for all of them um the only which differs is the definition of this distribution term here like what's the probability that I'm in some State s um and and they're just rescalings of each other um so they turn out to basically follow the same gradient Direction and so we don't really need to worry which of these we're following because we end up with the same policy gradient um so now what we're going to consider is trying to optimize one of these objectives so we're going to pick one of those three objectives and we're going to say what we want to do is to find the U find the parameters that maximize this objective function so find the U such that when you drop your your ibow robot down your ibow will walk as fast as possible um and so how can we do that well there's lots of familiar approaches to optimization hopefully you've come across some of these um optimization methods are classically divided up into gradient based methods and um gradient free me methods for optimization so if you don't have access to the gradient um there's some algorithms you could use for optimization like hill climbing or Simplex method where you kind of got a triangle that you flip over and over and over as it folds its way down the hill sometimes called the amoeba or nomid um genetic algorithms these are all optimization methods that you could throw at a a policy search problem and try and find the best policy which maximizes one of these objective functions but typically the reason people consider gradient base methods is if you have access to the gradient you can almost always get greater efficiency out of your optimization method um and some of the familiar methods for doing this are things like gradient descent conjugate gradient quasi Newton methods these are there another large family of methods that make use of the gradient information um to gain greater efficiency because you know this gradient points you in the direction of of Greater um reward more where points to the direction of how to do better in this mdp rather than having to blindly try things by trial and error figure out which of those is the best so we will focus on gradient descent the simplest instance of this of course once you understand how to do gradient descent you should also understand that this can be extended to other methods um for example second order methods quasi Newton um all of these things are possible but I think the ideas are best Illustrated and often work most effectively with the simplest case um which with very little computation um you can get a great deal of um um effective reinforcement learning out of this we're also going to focus on methods that exploit the sequential structure by which I mean we're not just going to do blind optimization like a genetic algorithm um where you have to run you have to throw your robot down wait until the end of its lifetime to get one number that pops out of it that says you know I achieved X in my lifetime we would rather break open the trajectory see that partway through this trajectory it had achieved some um sequence of states and rewards and make use of that sequence of states and rewards to to um do better than waiting all the way until the end until this thing dies before you can actually optimize it so we'd like to learn within the lifetime of an agent okay so we'll start with a very simplest version which is finite difference policy gradient methods um so so let's begin by this our familiar idea of what policy gradient is so in the last lecture we considered policy um we considered gradient descent because we had something like a mean squared error and we were trying to minimize the error um in this lecture we're going to consider gradient Ascent where we have an objective function which is something like how much reward can I get out of this system and we want to make this thing higher we want to basically Ascend our objective function so we can imagine that there's some surface that says you know for um any policy parameter so imagine that these are two par two different policy parameters and there's some surface defined by those two different policy parameters and they might be parameters of how my eyebow walks around and we're going to adjust those policy parameters um in the direction that gets us a faster walk for example um and and this surface has some shape and the gradient basically points us uphill in the direction of steepest Ascent uphill and so all that says is mathematically what we're going to do is adjust our policy parameters a little bit in the direction of the gradient of this objective function and the gradient of this objective function is just the vector partial derivatives so it's the vector that says you know along this direction here um how much progress can I make by adjusting this parameter along um this other axis here how much progress can I make and we just move things in a direction that maximizes our progress um and makes the most possible difference in a infantes radius around where I am now so we're adjusting the parameters to get more juice out of our environment okay um so the simplest way this is the naive approach the numerical approach um if you had no way to understand what the gradient was you could just do the following you could basically um look at your objective function and estimate this thing numerically by saying well if I just perturb my parameters a little bit I could perturb them um in each Dimension I could say well what would happen if I changed my policy parameter along the first axis a little bit and I would look at the difference between um the objective after I perturbed it and the object before I perturbed it that would be an estimate of the gradient in that direction um and then we do that for each Dimension separately and that gives us a numerical estimate of the gradient um this is a little bit naive because you need an awful lot if you we're working High dimensional space you know imagine your policy has is a neural network with a million parameters now you need to do a million different evaluations of um your policy you need to perturb it in all those different directions to get an estimate of what the gradient is in that direction um there are techniques that um use random directions like spsa um which reduce the number of samples you need to take but they still are very very noisy and so it takes an awful lot of computation using these methods to actually even get one estimate of the correct gradient but nevertheless it's simple there's no um you know this is if you really can just perturb your policy and try again multiple times you will get something which points you in the correct direction eventually um and so it you can work for arbitary policies and one big Advantage is this works even if you've got some non-differentiable policy like someone exposes your gives you the program for your robot and tells you you know here's some parameters for your program but we don't know how to differentiate them you know you can still you can still adjust it by by following this process um so I've been talking about an IO so let's make that concrete um so it turns out that in robocup soccer um things have got a little bit better but I think it's roughly still true that with the IBO league in particular this is one of the leagues of Robo Cup soccer they have these little dog robots the ibos running around trying to score goals against each other um so it's quite a fun competition um but the thing which determines which team wins is basically how fast you can get your your eyebow to to run if it can walk faster you basically are able to just take the ball and move it faster down down the field and so um so the team with the fastest iOS historically has always won and so it's become an interesting machine learning challenge to make the IBO gate um more efficient and make it run faster um and so the IBO gate is basically controlled by 12 different U real numbers which are um sort of shown in this diagram here doesn't really matter what they are um but the goal is to adapt these parameters now and the first people to do this Peter stonel they did it by finite difference policy gradient um just like we just described in the last slide um and what they did was they basically just ran these robots backwards and forwards along Ong this field and measured how long it took them with each gate and gave them a few walks up and down and measured how long it took um did that for each of these 12 Dimensions separately um and then you say you have to run it up and down to get one dimension run it up and down again 12 times to get just one gradient update you adjust your parameters in that direction and and iterate this process until your grad students collapse of exhaus you um and and so they did this um and let me see if I can pull up some some videos um okay so this is at an early stage before it's really being trained and I think what the main thing to notice is that it's um it slips a lot that this thing is really dependent on the on the ground which it's on and it tends to slip an awful lot um and over time it starts to learn something a little bit better this one uh oh oh no there we go let me try that again I think that was just a artifact no but you can see it's learned this much better gate that um deals with the um the slipp of the of the surface much better and it can move much much faster so this was just by adjusting the parameters in the direction of the gradient in a very naive straightforward way um and you know some limited number of iterations were sufficient to make progress in this domain and actually figure out a a reasonable gate now if you move to higher Dimensions um these numerical methods tend to collapse and so what we'll consider for the rest of the class is how to analytically compute the gradient so you don't have to laboriously compute this gradient separately for each of your um dimensions okay so we'll start with um simplest approaches no value functions yet um that's the Monte Carlo approaches and then we'll come back and we'll bring value functions back into the picture Okay so we're going to use the following idea which is we want to compute the policy gradient analytically and what we're going to assume is that our policy is is different iable doesn't actually technically have to be differentiable everywhere it only has to be differentiable when it's actually picking actions so if you have some non- differentiability but you're not actually ever picking that action that's okay um we're also going to assume that we know the gradient of our policy that we're in control of our policy you know something like a again could be a gum policy or a softmax policy we'll see some examples or a neural network um and we know the grent of this thing because we've created it um and what we're going to use is a trick called likelihood ratios um and the likelihood ratio trick I'm just going to put it up in its in isolation because what we're going to see is pretty much for the rest of the course this magic gradient of the log policy term is going to appear um and I think it's useful to understand why this gradient of the log appears because otherwise you'll just be scratching your head saying you know why why are we suddenly looking at the log of the policy um not the policy itself and it basically comes up just from this following um likelihood ratio trick where what we're going to want to do is we want to take the gradient of our policy we want to understand the policy gradient and we're going to want to take expectations of that thing and so the way we do that is we basically note that we can multiply and divide by our policy without changing it and at this term over here on the right the gradient of the policy divided by the policy is equal to um the gradient of the log of the policy okay um so that's just um straightforward um calculus and this term over here is a very familiar quantity from statistics and machine learning it's sometimes called the scrore function but if you were doing something like maximum likelihood learning this is the term which would come up if you were just trying to maximize the likelihood of some action a which someone gave you you would follow this gradient this is the thing which maximizes the log likelihood um and so so the way to think of this term is this is the thing which tells you how to adjust your policy in a direction that gets more of something so if you were to tell it the right action and you were doing supervised learning you would plug in exactly this term for some particular supervised action here and that will tell you how to get more of that particular action so this tells you how to adjust your policy in a particular direction to achieve more of something that that's good um and so we're going to use that and what we see now is that by basically rewriting this gradient in this way we're able to take expectations like Computing the expectation of this thing is hard but Computing the expectation of this thing is easy because we have this policy here and this is actually the policy we're following and so we can take an expectation of this term because this is something we can actually sample from this is the policy that we're actually following ourselves um so we'll see that in a minute but just before we do that let's consider this score function and understand what it looks like for two common examples so perhaps the simplest example you should have in your mind which we use for discrete actions is the softmax policy so the softmax policy um is basically something where we want to have some um smoothly parameterized policy that tells us how frequently we should choose an action for each of our discrete set of actions so it's an alternative to something like Epsilon greedy or um and and the idea is to basically say um what we're going to do is we're going to take some features we're going to form some linear combination of features for example and we're going to consider that as like there's some kind of value that tells us how much we'd like to take an action so we've got features of our action in our state we've got some parameters for those features and then we waight those features by um by some weight and we add it all together and that tells us how much we desire that particular action a and then what we do is to actually turn that into a probability we just exponentiate it and we normalize so the idea is that the probability that we actually pick an action is proportional to the exponentiated value that we get when we take a linear combination of these features so this is called the linear softmax policy um so softmax in general is a policy that's proportional to some exponentiated value and you can choose those values to be anything you want and here we're just parameterizing those values in the simplest possible way by using a linear combination of some parameters with our features so this is a very straightforward way to um to parameterize the policy in a discrete domain so imagine we're doing Atari um and we want to know you know should I go left or should I go right well we'd have some features um of going left and we'd have some features corresponding to going right we would weight each of those and whichever one scores more highly when we make this weighted sum um would get higher probability when we actually come to pick these actions and now what we're going to try and do is find the gradient of this thing we want to know well how do we adjust this thing so as to get more of the good things so we need to know the score function and the score function is very intuitive for this softmax the score function the gradient of this lock policy is just the feature for the action that we actually took minus the average feature for all the actions we might have taken so so it's basically saying how much more of this feature do I have the usal and that's the score function and so what we'll see is when we start to do these kind of policy gradient algorithms what we'll actually end up doing is saying you know if a feature occurs more than usual that's what this thing is telling us if it occurs more than usual um and it gets a good reward then we want to adjust the policy to do more of that thing okay so this is the softmax policy we're also I just want to give one more example of of policy which is the the gaussian policy so in a continuous action space like the IBO example um the commonest policy to use is a gum policy um and what do we do with a gan policy well we basically just parameterize the mean of this Gan and then you have some Randomness around that mean some variance around that mean that says most you know most of the time I'm going to take this mean action that's given by say a linear combination of some features um but sometimes I might take some deviation from that mean and be some variance like Sigma squ um which could also be parameterized you could have a variable variance um but the simplest case is to say now the action is selected According to some normal distribution with mean mu of s where this might be a linear combination of our features like before we've got features of our state and some weights for those features and some variance parameter so basically we get to pick the mean and then we just um add some noise to that thing to make it to fastic and so if we were using a gan policy what would this score function look like well the score function is basically again this tells us you know how to get more of of a particular action um and and here again we have something very similar to the last slide where this tells us this is the action we actually took this was our um our action a u and this is the mean action this is the mean so we basically want to know you know let's take the action we actually took minus the mean that tells us how much more than usual we're we're moving we're we're doing a particular action um multiplied by the feature and then we just scale that by the variance so in both of these cases the score function takes this form of sort of how much more than usual am I doing something um and and that's quite intuitive when we start to look at these policy gradient updates okay so that's just to give some intuition for what these score functions look like um now let's talk about the actual policy gradient this analytic policy gradient and to do this I'm just going to derive it for the simplest possible case and you know homework if you like could be to extend this to the full mdp case uh we'll talk about you know what it looks like but I won't derive it for you um and the special case we're going to consider uh onestep mdps so what do I mean by onestep mdp I mean that you start in some start State s um you get to take one step and you get one reward that depends on where you were and what action you took but then the episode terminates immediately so there's no sequence in this case it's just a one step and then you're done and you get some reward and the goal is to pick your actions so that you maximize that reward and the right thing to do might depend on the state and and the action um so we'll see later this is also called in subsequent lectures a type of contextual Bandit um and so now if we were to try and solve this um we want to find the policy gradient we want to understand and how to adjust our policy parameters so that no matter what state we're in we pick an action that's that's good we want to move our adjust our way of picking actions so as to get more rewards and the way we're going to do that is to start off by picking our objective function so in this one step case all three of those objective functions we talked about become the same um essentially um yeah and and so our objective function is just to uh the expected reward our policy so we want to find the parameters which give us the most expected reward and so we want to find the gradient of this thing and send the gradient of this thing which will give us more expected reward so if we're thrown into any state and we pick an action according to our policy we want an expectation that to give us the most reward and so the way we're going to do that um to just expand this out um so this is the expectation over the state that we start in and this is the expectation over the actions that we pick under our own policy so this is the expected reward um and now we're just going to plug in the likelihood ratio trick so we're going to take the gradient of this whole thing we want to know the gradient that gets us more award so the gradient of this whole thing um basically the only term which depends on you is our policy um so the gradient of this whole thing um we can push the gradient right in to the gradient just of this term here um and expand that using our likelihood ratio trick so this whole thing the gradient becomes now the policy multiplied by the gradient of the log policy multiplied by the reward that you get and so this is again now an expectation that's the whole point of using the likelihood ratio trick we start with something which is an expectation you take the gradient and you recover something which is still an expectation so this thing is an expectation under our policy so this is the expectation under the start State an expectation under the action that we pick of the gradient of the log policy multiplied by the reward so it's just the expectation of the the score times the reward so after all that we basically come back to something very simple which tells us that if we want to increase our our objective function if we want to get more reward we just need to move in the direction um that's determined by the score times the reward okay and so again this tells us this tells us how to adjust our policies so to get more or less of something and this tells us whether it was good or bad so if you have a really big positive reward you want to move more in the direction that gets more of that thing if you have a really low negative reward a really large negative reward you want to move in the opposite direction to your gradient okay any questions St silence okay um so yes can I ask um this is model three right yeah and we have the capital R means no rewards and everything okay so so notation this capital r is the the random reward the actual reward that you experience this curly R is the um reward function for the mdp so we've expanded this in terms of something which actually is described in terms of the Dynamics of the true mdp but we brought it back again to an expectation where we only need to look at the the sample reward that the agent actually experiences so this is completely model free so model free what this means is if we want to estimate this expectation we just take an action from a given State we compute the gradient of this log policy which we know because it's our own policy and we multiply it by the reward that we actually experience and that gives us a gradient step and we follow that gradient step so it's model three good question okay so we want to do the same thing in in mdps multi-step mdps not just these one-step things and to do that we actually just need to do one trick which is to replace this instantaneous reward that we had so we were multiplying the score by the immediate reward and to to do a multi-step mdp all we need to do is actually replace this immediate reward with the value function the long-term reward and that actually turns out to be the true um gradient of the policy and the policy gradient theorem tells us precisely um what that means this is just to formalize that which just to say that if you start with some policy um then for any of those objective functions we considered earlier the start State the average award the average value um the policy gradient is basically given by this thing here um which is some expectation over the score function multiplied by the action value function Q so it basically tells you you know again how to adjust the policy so to get more or less of that particular action multiplied by how good that particular action was or that particular action is in expectation so you want to adjust the policy in the direction that does more of the good things and less of the bad things and if you were just doing sup supervised learning and maximum likelihood learning you'd have something which looked very very similar um but you just wouldn't have this value function term here you would just be adjusting things in the direction that made your policy achieve more of the things that your teacher tells you and so what we have in reinforcement learning is instead of that we adjust it for all actions we just have to try those actions and see well are they good or bad in our according to our value function okay so that's the policy gradient theorem um let's consider the simplest way to actually use that now to make an algorithm and we call this the Monte Carlo policy gradient which is roughly the same as a really old well-known algorithm called reinforc um and the idea is to update the parameters by stochastic gradient Ascent so we're going to get rid of that expectation now and just do something very practical and we're going to use the PCY gradient theorem and what we're going to do is just sample that expectation um and the way we're going to sample that expectation is just to say um what we wanted to the term which appeared in the gradient theorem was this Q so let's just sample Q let's just estimate Q by using the return um as an unbias sample of this Q so we're going to be in a state we're going to start by taking this action and we're going to see what return we got and we're going to use that as an estimate q and we're just going to plug that in um into our policy gradient um to give us an Direction an estimate of a direction to move and so every single episode now what we're going to do is we're going to say you know at each step we're going to adjust our parameters a little little bit in the direction um of this score multiplied by the return the return that we actually got from that point onwards and so what would that look like in an algorithm well you'd start off you pick some arbitary parameters you run your episode and for that episode um so first of all you run forward you generate all of your your rewards this is like a forward view algorithm so you have to go all the way to the end of the episode generate your rewards at the end of that you know what the return was from each of your steps now so now go over each of your steps again um and you say um for each of those steps we want to adjust our parameters a little bit in the direction of this gradient the stochastic gradient um which is the score how to adjust up our parameters in Direction um that gets us more or less of something multiplied by this sample return from that step so that it's as simple as that it's a very simple algorithm so this is a the the most straightforward approach to policy gradient you just sample returns and you adjust policy in the direction that gets you more of those rets and GT in this case is R GT GT is the sum of the rewards it's the it's the return so it's the sum of the the rewards from the beginning of the uh from time step T onwards until the end of the episode so it's the same definition of return that we've used before so GT is the return um accumulated rewards from that time step onwards um okay so here's a little example um this is using that same Monte Carlo policy gradient um with a particular um policy uh parameterization and the idea is that you start off with this uh this puck in this world and and this Puck is very sliding along there's very low friction in this domain and you just get to exert a small force on it every time step which can just slightly protur its direction so if you're not careful it can really overshoot its its Target and then the target is given by this p again which moves around every now and then um and it has to try and reach that Target um and so if it it gets a reward when it gets close to the Target and it's reset every 30 seconds and then you just go you just run trajectories and you use this Monte Carlo policy gradient um and you upda in the direction of those returns so you always you measure your return you go back over it and you adjust your policy a little bit in the direction that gets you more of those returns um and then you see how how this works and so there's two things I want you to take away from this the first is that you get this very nice smooth um learning curve value based reinforcement learning tends to be much more jaggy you end up with sometimes collapses and sometimes Chatters but you can be uh can do very well um the second thing to take away so solves this problem very nicely but the second thing to take away is the the scale here that I mean we're talking you know 100 million iterations here to solve this problem um so it's slow Monte Carlo policy gradient methods tend to be slow um they're very high variance they're looking at this complete um return of what happened estimating the gradient from that return and moving in the direction of that estimated gradient and so the rest of this class is going to be about using similar ideas but making them more efficient and how to take this very nice idea where you get this smooth learning where you adjust the parameters of your policy to solve a problem but make it more efficient reduce the variance make it work better okay and so that's going to bring us to the final family of algorithms act to critic methods so just a quick recap um so the main idea is that we've got this High variance estimators of the of the gradient now so we had this policy gradient we plugged in the return to estimate the the gradient by taking a sample of our Q values but that thing's very high variance like you know um it might be that I'm playing you know Atari again and on one particular episode I might get a score of 1,000 on next episode I might get a score of zero and that's just in the randomness of what happens because you know over the course of 10,000 steps there are many many different random events which might occur this thing's very noisy and the main idea of actor critic methods is that instead of using the return to estimate the action value function we're going to explicitly estimate the action value function using a Critic using a value function approximator we're going to bring back the ideas from the last lecture combine value function approximation with our policy methods so again we're going to have our true action value function and we're going to estimate this thing using a function approximator and then we're going to plug this in to our policy gradient as a substitute for Q Pi um so what this means is that we now have two sets of parameters and that's why these are called act critic methods we've got a Critic and an actor and so intuitively the actor is the thing which is doing things in in the world it contains the policy it's picking actions it's actually making the decisions of what to do in the world it's called that's why it's called the actor whereas the critic doesn't actually take any decisions it's just watching what the actor does and seeing whether that's good or bad evaluating that thing saying oh you know those decisions were good they got a score of a thousand or they got a score of minus a thousand so we're going to combine those two things together the actor and the critic and the main idea is now to use an approximate policy gradient instead of the true policy gradient we're going to adjust the actor we're going to adjust the policy in the Direction which according to the critic will get more reward so the critic is going to say hey I think that if you go in this direction you can actually do better and then the actor is going to move in the direction of that gradient and so the way we're going to do that we're going to take um our original policy gradient algorithm which was an expectation of the score multiplied by the value function that was our policy gradient theorem but we've replaced the true action value function with this estimated approximate value function where we've got our neural network or whatever to estimate this thing and what we're going to do then is at each step we're just going to move a little bit using stochastic radiant Ascent every step we're going to move a little bit in the direction of the score multiplied by a sample from our our own function approximator so the critic is saying hey you know I think this thing is good or I think this thing is bad and then we move a little bit in the direction that gets more or less of the things that the critic says are good or bad okay um so how do we estimate the the action value function how do we estimate Q what do we do for the critic well that part that should be familiar that's what we did last lecture we started off with this problem of policy evaluation um we want to know you know if if I'm following some policy U Pi if I'm following this policy Pi we're trying to estimate this Q Pi we're trying to say how good is that policy we're not trying to optimize it we're not trying to build a critic that estimates qar we're just trying to say according to this current behavior um how good is it how much reward will I get under this current behavior and then we're going to use that to point us in the direction that will improve our policy um so a way to understand this is we want to say how good is this current policy Pi Pi U for our current parameters U we want to know how good is that thing and so we can use everything we've learned so far um Monte Carlo policy evaluation Temple difference learning TD Lambda pick your favorite policy evaluation algorithm lease squares temporal difference learning anything you like throw it into the pot get an estimate of your of your action value function and use that to adjust your actor so here's a canonical example just to make things concrete so let's say we were using a linear value function approximator for our critic so we're going to estimate Q by using some features of our state action multiplied by some critic weights V so in this class we're going to use V for the critic parameters U for the the actor parameters just to keep those separate and distinct from what we did in the last lecture so we're going to estimate our our critic just by using a linear combination of features and then the critic we could just update in the usual way using linear td0 and the actor we can update the actor parameters U um by gradient score multiplied by what the critic says um so that would look something like this this is the the Q actor critic um so basically what this is saying is that every step of the algorithm now we don't need to wait until the end of the episode this is now an online algorithm which we can every single Step perform an update because we're using TD in our critic we don't need to wait all the end like we do with Monte Carlo so every step of our episode or every step that the agent EXP experience is we're going to sample a transition we're going to see what the reward was we're going to see where we ended up after that step um going to pick an action according to our own policy we're going to get the TD error between um the value before that step and value after that step just like we did in the previous lectures now we're going to update our critic in proportion our critic we're going to update a little bit in the direction of the TD error multiplied by the features so that was linear TD Z so that's from last lecture that's how to update our critic in the direction that minimizes the error between um what we thought was happening before what we thought the value was and what the value ended up being after one step which is like correcting our errors and making ourselves self-consistent um and we're also going to adjust the actor we're going to adjust the actor in the direction um that gets us more of the things that the critic says are good so that's the actor critic idea using um canonical case using linear TD okay any questions so far I'll just pause there before we we're going to start adding you know more tricks to reduce variance make these algorithms more efficient so so if you're if you're feeling uncomfortable now is the time to say because we're just going to layer more more pieces on from from here on in yes good um so you have a policy you get a estimate and then from that estimate you try to get another policy is that true that's right yeah so so we can think of this as another form of generalized policy iteration where we start off with a policy we evaluate that policy using the critic and then um instead of doing greedy policy Improvement we're moving a gradient step in some direction to get a better policy so it's just like our familiar Machinery where we're making the policy better and better and every step we're evaluating that policy moving a little bit up the gradient to to find a better policy my question is choosing the first policy does it differ choosing the first policy you can pick an arbitrary policy um so you initialize your policy parameters however you want it could be it could be typically now we've just got parameters though we're not using e anymore we've just got some parameters and those parameters determine our policy I'm scar okay um and we're making those policies uh we're following the gradient so as to make those policy parameters better so that our policy gets more and more award out of the mdp okay um so we start off with some policy parameters that determines our Behavior we want to make that behavior better and better and better there's no greedy anymore no Epsilon greedy the policy itself determines how we how we move around in this environment these policy parameters we're always picking we're always picking actions according to the policy that's determined by this U these parameters you so yeah one more question but if your if your policy was Epsilon gitty your Q function M do you like fall back to Q um let me go back to a slide that I kind of skipped at the beginning because the projector was down this one okay um we had this in the introduction but I think it's worth reiterating um so let's classify our algorithms into either whether we have a function that we represent whether we have a policy that we represent or whether we have both that's what this V diagram represents if we have a value function if we represent that value function we call it value based if we have a a policy that we represent we call it policy based and if we have both of these things in our algorithm we'll call it an acor critic okay um and Epsilon greedy Epsilon greedy is something we do when we when we live in this side of the diagram Epsilon greedy we don't actually explicitly represent the policy instead we use a simple trick which is say if we've got the value function if we've got Q we're just going to act greedy with respect to q and sometimes act randomly what we're considering now in this lecture is an alternative approach which is to actually learn directly how to pick actions going to parameterize the policy so instead of doing greedy or Epsilon greedy we're directly parameterizing the policy we're deciding how to pick those actions and we want to figure out well if I pick my actions in a particular way will I get more award will I get less reward um and um and so they we actually in a different space here Al together where we're directly parameterizing things so there's no greedy there's no Epsilon greedy anymore but have you parameterized your policy to be like greedy over your your value function if you parameterize it so that's like an implicit parameterization which I would say is what we call Value based here so it's like if we if we're basically saying our policy is the thing which is acting greedy with respect to my value function that's purely value based we're not explicitly representing the policy but does that algorithm still work is there a way to recover a pure value based method um there's a result which says that basically um for a particular class of of of um policy gradient algorithms that we'll see shortly um it turns out that if you take the limiting case where you make your step size bigger so for all of these policy gradient algorithms there's some step that you take in the direction of the gradient and you could ask the question what happens if you take an infinite step in the direction of that gradient um so if we let's get back to where we are so in all of these algorithms there's some there's some gradient update where we update our parameters a little bit in the direction of our score and for a certain class of policy gradient algorithms if you make this thing infinitely large um then it actually takes you to a greedy step so that's true for certainly for natural gradient algorithms which we'll see later so the step size is what kind of controls how greedy you are so think about the softmax policy okay this thing where we were exponentiating some score if you take a if you see that like one of these actions appears to be a little bit better than the other um so so you had you had some policy where you were taking this left more often than right and now you see something where left actually did even better than right so you saw a good score for this guy what are you going to do um well you're going to push yourself a little bit in the direction that makes left even more probable and right less probable if you make your step size infinite you'll push the probability of going right all the way down to zero and the probability of going left all the way up to one um which is equal to acting greedily so there is a limiting case where you recover kind of greedy greediness and the step size is what controls the smoothness um so so it's certainly true for some algorithms yeah I'll take one more question and then move on Sor so when when you were learning a value function yeah you had the development operator being contraction so you you knew you were going to get to the optimal value yeah um do you have the same thing here because you could have sub optim okay this is a really great question so the question was you know if we're using policy gradient methods do we still have the same guarantees that we'll find a unique Global Optimum or could we get trapped in some local Optimum um so it turns out that at least in the case where we're using so you could consider table lookup representations of policy um so let's consider the table lookup case in the table lookup case you can represent your value function by having one value for each state um and if you use the Bellman um equation you get a contraction you guarantee that you get a global um Optimum um with policy based methods if you just follow the gradient for example with the softmax it's known that for the softmax um that you also find the global Optimum in the table Lookout case so if you have a softmax which is like a separate softmax parameters for each state you also achieve a local Optimum now for the case where you've got some more General function approximator um it's clear that if you've got something like a neural net neither method will guarantee that you find a global Optimum you can always get trapped in some kind of Optimum for certain special cases in between it's unclear and I think it's an open research question actually right okay so so we're still in this space where we've got like this basic principle we've got this family of actor critic methods we've got this family where the actor is basically picking the actions we've got some policy we're using that to pick the actions we've got some um critic which is basically evaluating those things and saying you know whether those actions are good or bad and then actually the actor moves in the direction suggested by the critic that's the basic idea of all of these but what we want to do is to make these things better so we're considering now now some some tricks to make make this thing better so the first trick and the perhaps the easiest and best known trick is is to reduce variant using what's called a baseline um so the idea is to subtract some baseline function from the policy gradient and this can actually be done in a way that doesn't change the the direction of of ascent in other words it changes the it it changes the variance of our estimator changes the we can reduce the variance of this thing um without changing the expectation um so another way to say that is that what we're going to do is we're going to subtract off some term which looks like this from our policy gradient so our policy gradient was the score um multiplied by the value function and we're going to subtract off that subtract off the score multiplied by some baseline function and now we're going to choose that Baseline function just to reduce the variance to kind of make this thing mean zero to make it roughly about the right scale so that um so that we basically you don't want to end up in situations where where one moment you see a reward of a million and the next moment you see a reward of 999,000 um and you have to kind of adjust in the direction where you're always moving a policies um parameters up but it's just the relative difference that determines how how much up you should make them be much better to have like plus one or minus one um and and so this gives us a way to rescale things um and so so very very briefly what we see is that if we just expand this out so our expectation is just this again is our expectation Over States um multiplied by the gradient of our policy multiplied by the Baseline um we can pull um the Baseline outside of this sum because it doesn't depend on the action uh we can pull the gradient outside of this sum and then we know that our policy sums up to one this is a probability distribution so the probability distribution must sum to one um so now we have the gradient of one gradient of a constant is always zero um and so we see that this whole term here is actually zero like this whole term that we're adding or subtracting is is zero mean it's got zero expectation okay so it's completely legitimate to add or subtract any term of this form so what that means is that we can basically um whenever we we have our our value function which we so if I go back um whenever we had our value function here this was our our policy gradient theorem and our policy gradient theorem told us that the direction we really want to move is this score function multiplied by q but now what this result tells us is that we can add or subtract anything we want from this Q as long as that thing is just a function of state and not a function of action we can add a anything we want from this so as to control the variance of this whole term here we want this expectation to be low variance but we don't want to change the direction of of aent um so we can do that and there's a particularly nice choice that we can pick for this thing um which is going to have some nice consequences for us um and the particular Baseline we're going to choose is the state value function okay so what we're going to do is we're going to start off with our Q values our action value function and what we're going to do is subtract off the state value function and what we're left with is something called the advantage function this is something which tells us how much better than usual is it to take action a so we basically we're going to look at the Q value of action a which tells us how good is it to go left if I'm in a particular State compared to how good is it to be in that state in general and that's that tells us how much better than usual how much more reward than usual will I get if I take a particular action and so intuitively that's just the right quantity to use in our in our update so we call this thing D the difference the advantage function um and so what we're going to do is now rewrite our our policy gradient theorem in the following way so this is still the policy gradient theorem this is still just ground truth we've just Rewritten it by subtracting off a particular Baseline function um and so a policy gradient theorem we can rewrite as the expectation of the score multiplied by the advantage function so again let's let's see if we can understand this intuitively it's basically telling us this tells us how much better than usual um is a particular action a and this tells us how to adjust our policy so as to achieve that action a so if this thing is ever positive if we ever do better than usual using a particular action a this will tell us how to move in the direction that achieves that gain if this thing is ever negative this will move us in the opposite direction away from the direction that that achieves that particular action so this is always pushing the policy parameters towards situations where you do better than usual okay so that's the advantage function so again we're just rewriting the same policy gradient theorem so how do we estimate this Advantage function now how do we do this in our critic well there's a lot there's a lot of different ways to do this actually and I'm just going to suggest a couple of them um so this slide is just to say well one way to do this would be to learn both q and V so our critic would basically learn Q it could also learn V using yet another set of parameters um and we would just take the difference between those things as our estimate of the advantage function so we could basically do that this becomes more complex in the sense we have more parameters but it gives us a literal estimate of the advantage function which we' then plug back in here so we can plug into this policy gradient theorem here we plug in this advantage function the difference between our our our state action value function and our action value function there's an easier way and probably a better way um and that's what this second slide is about um so this is probably the most commonly used variant of the actor critic although it depends who you talk to and many people use many different variant um and it uses the following idea which is that the TD error is um a sample of the advantage function so consider the following TD error this D Pi this Delta Pi thing here so if we knew the True Value function V Pi then this basically tells us that the TD error is equal to the reward plus the discounted True Value at the next State minus the True Value in the state I'm in that's just the definition of this tdor now in expectation we can take the expectation of this whole thing we can say well what's the expected TD error the expected TD error is the expected reward plus discounted value at the next State minus the value of the state I was in the expectation doesn't affect this thing because there's no random variable in here s was s is what we're conditioning on um this term here the expected reward plus State value at the next state given s comma a is precisely the definition of q this is just like unrolling our Bellman equation one half step um so Q is equal to the expected State value at the next step and so this whole thing here basically tells us that the TD error is an unbiased sample we've basically got our Q minus V now this is the advantage function this whole thing tells us that the TD error is an unbiased sample of the advantage function so if we want to move in the dire of the advantage function all we need to do is to measure the TD error and move a little bit on the direction of the TD error in other words if we ended up I'm in a state where I think I'm winning a game you know I make an action I find myself in a situation where I'm losing the game so that generates a TD error um these things were inconsistent I now have to think oh well I was probably losing the game all along so you end up with this strong negative TD error that tells you you need to reduce your your Value Estimate and so now what we're going to do is plug in that TD eror as an estimate of the advantage to say that whatever I did on that step that generated that TD error whatever move I made um you know it was probably you know a bad idea there was something which I was in a situation where I took an action that took me to a situation where I started off thinking I was winning and I took some move and now discovered that I'm in a situation where I'm losing so probably that was a bad move and so that's the intuition here we're now going to plug in that TD error as an unbiased sample estimate of the of the advantage function um and the nice thing about this is that we only need to estimate V in our critic we don't need to estimate Q we just need to estimate the state value function there's no actions that come into it um and and what we do is we just plug in so this would actually be another way to rewrite our policy gradient theorem again this is just a rewriting of the policy gradient theorem we haven't changed anything there's no approximation yet but if we use a Critic to estimate V if we use a v hat now instead of the true V pi and we we use the TD eror based on our our approximate TD on based on our approximate um value function and we just plug in that that estimate of of the TD error there then we end up with something which is a good approximation to this policy gradient theorem that's a practical algorithm we estimate V we generate TD errors using our V and we move in the direction of those TD errors and we only need one set of critic parameters B to do that okay so just um a couple more ideas to throw in for the this family of different things that you can try and then I'll summarize and just bring them all back together I've got one slide at the end that hopefully has just like a summary where you can come back and say you know these are the different things that you can do with AC to critic um so this one is really just a um a reminder of of of the previous lectures which is to say you know what about what about this idea of Eligibility traces and different time scales so if you remember from the previous lectures we've often reused this idea that you don't always want to go all the way into the end of the episode and estimate the return and neither do you necessarily just want to take one step and bootstrap from the value function after one step because that that's biased you often want to trade off the bias and variance between these two things using something like TD Lambda so now we're going to do the same thing with actor critic algorithms it turns out that it's quite straightforward to do that um so if you remember from last Le lecture we had a choice of things that we could plug in so this is last lecture you could plug in when you were updating your value function approximator you could plug in the return you could plug in your onestep TD Target you could plug in your Lambda return you can make any of these the target um for your for your updates and then we also had this idea of Eligibility traces which basically gave a backward view that was equivalent to plugging in this Lambda return in the forward view so hopefully this is starting to become a little bit familiar um if it's not and go back to the notes and um because it's um it's useful stuff and key idea so what we're going to do now is to plug in exactly the same idea to our actors so I mean first of all we should say that you can you can do all this with um with the critic um and sorry I think some of these should say V rather than W but you can do all of this with the U with the critic where you can basically you know as I mentioned before the critic we can evaluate our policy by any means we like we can plug in any algorithm we want for our policy evaluator it could be td0 could be Monte Carlo could be TD Lambda lease squares policy valuation any of these are valid and and then and so we can introduce different time scales onto our critic um by choosing a TD Lambda on eligibility Trace but that doesn't affect the actor yet it doesn't affect the actor and the question is can we also get these different time scales on the actor can we get information to flow back more efficiently in the actor so the actor is not just based on the current eror but based on something which bootstraps from all different time steps and the answer is yes um and so very briefly all we're going to do is basically um we're going to plug in so this is again our our original this is the true PCY gradient theorem the score multiplied by the value function or the advantage function equivalently um and what we're going to see is that there's different things which we can plug in so far we've seen two different examples we've seen that we can do the Monte Carlo policy um gradient um which looks something like this where we plugged in the return so we Ed the return multiplied by the score as our estimate and we subtract off some baseline here so if we always just subtract off this Baseline value function here you end up with a return minus this this mis value function multiplied by the score when we used um the TD error we ended up with the the one step Target the TD Target the rewards plus the discounted value at the next step minus our Baseline okay so these are different targets when we use the Monte Carlo Target we end up plugging in the return when we use um this TD era version where we're estimating the advantage function using the TD error we plug in the TD Target and subtract off the the value function as a baseline so these are equivalent um these are different sorry these are different ways to estimate our original policy gradient theorem these are different approximations um and sometimes it's a good idea to do this this is a high variance low bias estimator of the of the policy gradient and sometimes it's a good idea to do this where we introduce bias by bootstrapping from our value function U but we dramatically reduce the variance and again the goal is to to get the best of both worlds where we want to come up with a gradient estimate which boot straps from our value from our critic all kinds of different time steps we don't just want to rely on our critic at this time step to to estimate gradient we want to use the critic so we want to be able to say well the critic at this time step says I should go this way the critic at the next time step says I should go this way so what we really want to do is to combine these together and actually move in some combined Direction so that's the idea of using eligibility traces and it's exactly analogous to the way we do it in value based reinforcement learning where we compute our TD error we build up an eligibility trace the eligibility Trace now um basically is an eligibility over our scores so we basically remember all of the scores that we've seen recently um and so if I had a high score for this particular action in this particular State um we remember that we bump up our eligibility trace and then later on when we come to make an an update U we move in the direction where the the scores have been largest most recently and most frequently so the way to understand this is the best way to understand this is by analogy with eligibility traces in the value based case where you can literally look at the updates and you can see that wherever you had an eligibility Trace in the value based case um the updates were identical like if we were doing value based reinforcement learning we would look at say our Lambda return um minus our current value and we would have just multiplied directly by the gradient of that value function not by the score so all we need to do is like a search replace wherever we had the gradient before we search replace with the score and that will actually work and give us a legitimate eligibility trait instead okay um so summary of this idea so we're building up our toolkit of ideas that help us to make acor critic effective in practice so now we've got a new piece to our toolkit which is El eligibility traces how to make our actor make use of critics from many different steps all the way into the future and the nice thing about this approach as well is that unlike Monte Carlo policy gradient we can apply this online we can um basically use critics that that build up um things going far far into the future but we can do it online and we can apply this in incomplete sequences or in in situations where um where the environment is continuing never stops or in situations where we're working off policy okay um so I think what I'm going to do is I'll talk very briefly about this so there's a really important question in actor critic algorithms which is in some sense you know it should be the first question that you ask about at critic algorithms which is that we've used this critic and we just kind of said um let's replace the true critic value with the some approximation to that critic value and we just plugged in that thing and hoped that the gradient that we follow is still correct in other words we we started off by deriving the policy gradient theorem we said this is the true gradient to follow if you really want to improve your policy and then we substituted in something we substituted in a Critic and we said okay well you should follow what the critic says instead of the True Value function but how do we know that that's valid how do we know that this really is pushing Us in the correct gradient Direction and we're not actually following some bogus gradient um and the answer is that amazingly if you choose the value function approximation carefully that you use it's possible to pick a value function approximator that doesn't introduce any bias at all in other words despite the fact that we don't have the True Value function and we're approximating the value function we can still follow the true gradient we can be guaranteed to follow the true gradient with our policy updates um and that approach is called compatible function approximation um and compatible function approximation I won't go into the details of it but you the main idea is to say that the features that we use to have a compatible function approximator the features that we use are themselves um the score function we basically build up U features for our critic where the features are the score of our policy and if we use that particular type of feature and we use linear combination of those features then we actually guarantee um according to the theory on this slide and the following slide which I'll leave to look at afterwards we can actually guarantee that we don't affect the policy direction that we actually still follow the true gradient direction if we follow these compatible features okay now I know that I'm throwing a lot of ideas at you and there's a big soup of possible things to follow I want to throw in one last idea um this is a recent idea I think it's a useful idea to know about um and and it's just this one slide which is to say that so far we've considered these stochastic policies we've considered these policies like a gan policy where we say I want the mean of my policy to be this thing and I'm sometimes going to take actions which are like the mean sometimes I'm going to perturb it a little bit by some noise um but this thing can be very very noisy like we're actually so far we've been estimating our policy gradients by sampling our own noise we've basically got a noisy policy and we want to take an expectation over that noise and it turns out that taking expectations of the gradient of our own noise can be a really bad idea that this thing is really really hard to estimate and in fact if you start off with like a gum policy and you start to improve this thing over time and you'd hope it would start to narrow down as you start to figure out how to solve the problem that as this gum becomes narrower and narrower and narrower the variance of your estimates actually starts to blow up to Infinity like it turns out that estimating the policy gradient becomes harder and harder and harder the noise basically hurts you more and more and more as you get better and better and better at finding the right policy and that's sort of an unfortunate property of the policy gradient algorithms we've seen so far um turns out that there's an alternative and this is something we did um last year uh which is to work directly with the limiting case so instead of adding noise into our policy and then trying to narrow this noise down to something which is roughly deterministic what if we just start off with deterministic policies so we're going to start off with some deterministic policy um and we're going to adjust the parameters of this policy policy this deterministic policy so as to get us more objective going to follow the same objective functions we did before and it turns out that um if you just take the limiting case of the policy gradient theorem that we've seen so far you end up with this very very simple update which has a slightly different form to what we've seen so far but again you should think of this as just another rewrite this again is exactly equivalent it's just the limiting case when we actually consider a deterministic function where we've narrowed our noise down to zero basically just got a deterministic function now we're just picking the mean always we're never adding any noise in and it has this particularly nice form where we say all we need to do to follow the gradient to adjust the parameters of this policy this should have a a u there as well if we just want to adjust the pars of this policy to get more objective all we need to do is to um take the gradient of our own Q function so we look at our critic our critic basically tells us hey look you know if you were um it gives us the gradient that says this is the way that we'll give you better actions it says actually look this is how you this only works for continuous action Spaces by the way but it basically says here's the gradient of your value function with respect to your actions this is how if you were just to make this action here a little bit higher you would get more reward if you were to make this other action a little bit lower you'd get less reward so the critic already contains all that information about how to adjust the actions so as to get more or less reward so that's all there in the gradient of q that tells you how to get more reward like just you know just move left a little bit more and you'll get a lot more reward that's that's this gradient of Q and then all you need to do is to work out how to adjust your parameters so as to get more of those actions which are which the critic is suggesting and so that's the gradient of the policy here so this is just the chain rule it's the chain R it says adjust the policy in the direction that gets you more q and so that's the deterministic policy gradient theorem it's very simple intuitive and in practice it seems to work a lot better than SC gastic policy gradient um in situations where we've got continuous action spaces scales up much better to high Dimensions okay um I'm going to skip the natural um actor critic um so there's one more approach which is in the notes which is the natural actor critic um natural act critic is an alternative descent direction or Ascent Direction um which has some nice properties and it turns out it falls out very nicely in the acoc critic framework that's there in the notes feel free to refer to it um but I wanted to put up this summary slide before we finish um and this is like a summary of all the different manipulations that we've done um to our uh to our different actor critic algorithms so we started off with basically our policy gradient theorem what we did was we plugged in different approximations to this thing um where we said okay the gradient of our of our policy we can get it by taking the score function and multiplying it by by the return uh we can multiply it by the value function we can multiply it by the advantage function so that should be a D um we can multiply it by the TD error this was another way to estimate the advantage function we take a sample of the advantage using the TD error um we can use an eligibility Trace where we accumulate our scores instead of just using the current score we accumulate an eligibility over all of our scores over many time steps and we use that eligibility Trace instead of just the one current score um and we can use the deterministic actor critic where we basically move in a direction that gets us more Q okay all of these are essentially different variants of the same idea they're all different ways they're different manipulations of trying to say move the policy in the direction that gets you more value and then the way that we estimate values in the critic um is what varies here um and the way that we make use of this how we update our policy is what varies between here and here and here so the the critic can vary the actor can vary but essentially they're all doing roughly the same thing and for each of these cases we can make a stochastic gradient Ascent algorithm where essentially all we do is we just drop this expectation by sampling the thing which is inside and in each case you end up with a very very straightforward simple algorithm where all you do is you you take a step or an episode if you're doing the top one um and you look at what happened in that one step only um we now have the gradient of the score for that step we took an action from some State we know for that state how to have adjusted our policy so to so as to have taken that action more often and we look at whether that step was good or bad according to our critic if it's good we push our policy parameters to do that thing more often so all of these have very simple stas IC updates um and then the critic we basically use our favorite algorithm Monte Carlo TD TD Lambda you know it's the whole family of different approaches that we've we've considered before in previous lectures okay that's it any last questions before we in the last 30 seconds okay thanks everyone see you next week e\n"

RL Course by David Silver - Lecture 7 - Policy Gradient Methods

Random Videos