Reinforcement Learning 2 - Exploration and Exploitation

Using Deep Neural Networks to Learn Preferences

A deep neural network can be used to learn preferences by setting up the input as pixels and then going through several layers, with the output being a vector that represents the preference for each of the actions. The semantics of this vector is the preference itself, which can be used to make decisions.

The key idea here is that the gradient will flow all the way into the network because we are parameterizing with these preferences themselves. This means that we can update all of the parameters in the network using backpropagation, and it will slowly change these preferences to go towards higher rewards. This is not just a theoretical possibility, but a practical approach that has been used extensively in practice.

One way to deal with continuous action spaces is to define a probability space for the actions. This can be done by defining a Gaussian probability distribution, which is still a stochastic policy and still parametric. The gradients will look different for this version compared to the softmax preference version. However, as long as we have a well-defined log pi for each action, we can apply the same trick.

Another approach to dealing with continuous action spaces is to use a deterministic policy that suggests one continuous action. While I won't explain exactly how this works, there are algorithms that can be used to update these policies using gradients.

Balancing Continuous Actions

One of the challenges when working with continuous actions is how to handle the action space. However, by discretizing the action space and then using a technique similar to the one described above, we can still use this approach effectively. However, this doesn't make sense in all cases, as we want to continue using continuous axes that generalize across the action space.

The importance of baselines

When working with rewards, it's often useful to add a baseline to the reward function. This can be done by adding a constant term that is independent of the policy parameters and the problem state. The sum of the probabilities (or the integral for continuous actions) is always 1 or infinity, so we can pull this out and get the gradient of the reward with respect to the policy.

The purpose of the baseline is not to change the expectation, but rather to reduce the variance of the updates. This means that instead of updating the preferences up if they're better than expected, and down if they're worse, we update them based on the expected value, which is now defined by this baseline. The baseline doesn't necessarily have to be accurate; it just needs to be something that can reduce the variance of the updates.

The effect of baselines

The figure from Chapter two shows the effect of using a baseline on performance. It's clear that using a baseline improves performance compared to not using one, for this specific setup. The details of the setup are in Chapter two and are not shown here.

"WEBVTTKind: captionsLanguage: enlet's get started welcome to the second lecture on the reinforcement learning this one's about exploration and exploitation and I'll go into depth about what I mean when I say those words I said something about that last lecture in terms of logistics just to make you aware just because of scheduling reasons we'll have a couple of reinforcement learning lectures now in fairly close proximity so next Tuesday there will be another during first learning lecture and then Thursday again so we'll have a little bit of a sprint you could say on the reinforced learning side of this discourse and then there will be a couple of deep learning lectures in a row so I'm not sure whether that's more help for or less helpful for you to grasp the material maybe it actually works quite well but I just wanted to make you aware also the so somebody noted last time that the deep learning assignment wasn't out yet so this was indeed a mistake on our on our end so we also extended the deadline for that and put it out reinforcement learning assignment also first reinforcement learning this time will come out in the weekend alright the end of the weekend basically before Monday so just to be aware of that and it'll mostly be about the contents of this lecture so if you want to pay extra attention because of the assignments maybe this is the one you want to pay attention to okay the background material which is useful to read for this lecture is certain umberto chapter two although I will be covering some stuff that is not in that chapter most notably on the Bayesian methods as you'll see later on in this this lecture but a lot of material he gives us a lot more background I mean being a textbook there's a lot more space to go into depth in some of these things so I highly recommend reading that chapter fairly soon maybe after this lecture and you'll see where we potentially disagree just to recap what we talked about last lecture this is the generic reinforcement learning setup there is an agent who observes the environment of the agents which could be the world at large say or it could be some small problem that you're trying to solve maybe a small mark of decision problem but in any case this these environments accepts actions from the agent the agent acts in these environments and environments in some sense responds by sending out a new observation or as I said before you could also interpret this as the agent basically putting this observation in if you prefer but in any case you get this interaction loop and then the idea is to optimize this interaction loop so in addition there will be a reward signal and then reinforcement learning is the science of how to learn to make decisions when you want to optimize this reward signal or maybe more generally when you want to learn about the world when you want when you want to learn to say predict future observations even if you're not optimizing anything and we talked about what could an agent then learn an agent could learn a policy a value function and or a model in fact will bump into each of these again in this lecture and the general problem involves taking into account time and consequences so especially this last bit is important because your actions might not just change your immediate reward but they might also change in some sense the world which means that then later decisions are affected because something has changed now it could be as simple as you moving somewhere means that the next state you are there rather than here which is a simple change in the environment it could also be that you change something structurally in the environment so that if later you come back to a certain situation it is now different than it was before so these decisions in general can affect the reward the agent state the internal state of the agents which includes for instance its memory and the state of the environment however in this lecture we're going to simplify things because in the simpler setting we can already meaningfully talk about exploration and exploitation how to make these decisions so what we're going to do is we're going to take away the sequential structure which means that past actions will now no longer influence the future states of the environment actually you'll have the opportunity to interact with a problem again and again without changing the problem formally what we'll have is well have a distribution of rewards where this is version will be identical for a given action across cross time so you'll basically be able to query you you can output an action you'll get a reward back you'll be able to query the system again and again and again without changing that reward distribution that's different from the general case and it's a definitely a simplification but it turns out that the problem that results from this is still rich enough to talk about many things and in fact the simplification also makes it possible to talk about some things that are harder to do in the fool case for instance we could talk about how to optimally solve the decision-making problem so this is a simple example of that just consider there's a rat and this rat has access to two levers a black one and a white one I realize now I was a little bit ambiguous in in this lecture when I say a black lever I think I always mean the one with the black background rather than the lever itself being black I'll get back to that whenever I actually call them a certain color just to make sure that we're we're on the same page so a Monday this rat pulls the black lever and gets a shock okay so maybe that's a bad leaf or maybe you just don't want to do that on Tuesday pulls the other lever and let's just assume there's just these two so Tuesday pulls two white leaves the one with a white background and some some latch opens and a piece of cheese falls out so the rats happy so then on Wednesday again the rat is given this choice you can pool either of these levers so what are you going to do does anybody have a suggestion for what the red should do yes I would also pull the white one if I were the rat but let's assume that the rat does pull the white one but now gets a shock so the question is okay what should you do next now maybe in this case you have a strong intuition for what your ass should do maybe some of you even disagree on what it what the rat should do maybe we can have a very quick poll could you raise your hand if you would pull the belief with the black background okay there's a small minority of course some people might just not raise their hand on either of these choices okay we'll get back to this example in depth later just keep it in mind so the trade-off between exploration and exploitation is essentially the trade-off between maximizing performance this is what we call exploitation and increasing your knowledge about the problem which we call exploration so this is a fundamental problem in online decision-making because you're actively collecting your data as I talked about in the last lecture this is different from getting a data set which you don't you have no means of changing anymore now we do we basically sample this is very related to research fields called active learning if you know about that where you're also in charge of collecting the samples but that's basically all I'm going to say about that these are very just be aware that these are very related if you happen to know anything about active learning if you don't that's perfectly fine it also means that the best long-term strategy may involve short-term sacrifices sometimes you want to try out try something that you think is not such a good idea you might not be completely sure you think it's probably a bad idea but you're going to try it anyway just to see or the alternative version of this which is similar but maybe feels a little bit different sometimes you might not pick the thing you think is optimal just to try something else even if you don't necessarily think that something else is bad so the main difference here is whether you think that the thing is bad that you're trying or not but in reinforcement learning is basically just means whether your rewards are high or low doesn't really matter you're going to pick something which has the highest reward you think right now or something that has a little bit of a lower reward and sometimes it makes sense to pick something that has a little bit of a lower reward at least according to your current estimates simply because you're not sure and it might be that you're wrong and therefore you want to learn more about the world so that means we want to gather enough information to in the end make the best overall decisions and we'll formalize what that means later so the setting to start with that is often called the multi-armed bandit setting and this is in an analogy to one-armed bandit if you happen to know that phrase if you don't it means it's basically a different way to talk about slot machines in a casino you have a slot machine there's a lever that you pull and then you either get some money or you don't this is sometimes called a one-eyed bat one-armed bandit because in the end in the long run it steals your money and then the multi-armed bandit you can basically think of has a long long row of these slot machines whereas each decision are corresponds to one of these machines and they may have different payoffs and in addition just to maybe be aware that we don't take the analogy too far in this case some of these slot machines might actually be a good idea so you don't necessarily lose money by playing playing these it's just an analogy but it's one that's very sticky so multi-armed bandit is a very common phrase for these problems and for me we just have a set of known actions it's just a discrete set at first there are exchanges in the literature two continuous sets where this is the real value but we are just going to consider consider a discrete set so there's just a limited set of these things you can you can consider a limited set of actions and whenever you try one of these actions and you can only try one on each time step you'll get a certain reward the reward here is assumed to be random so there's a distribution and what we're going to assume about this distribution in this case is that it's fixed but unknown so the fixed property of this distribution is basically what I meant when I said before we're going to get rid of sequential structure the distribution of the rewards will not be affected by your past action it's only affected by your current action so whatever you did in past doesn't matter if you're going to pick this action you'll get a certain reward with a certain probability and now we're going to formalize the goal as maximizing the cumulative reward but there is something that's importantly different from the previous lecture because in the previous lecture we talked about optimizing cumulative reward into the future but right now we're actually talking about not just within an episode but across the agents lifetime we want to accumulate all of the rewards and we basically want to maximize that cumulative sum what that means is that we're basically not going to give too much allowance for learning essentially we want the learning to be in some sense optimal that we don't want to lose too much in the beginning while we're learning to do better at the end in fact we're going to just fold all of these things into one and we're going to say you want to basically do as best as you can across your lifetime including the part where you're learning in fact we might assume that you'll learn the whole time you never really stop learning so this is a little bit different from your typical train tests split where you allow a train time to do anything basically you don't care what your losses at train time but a test time that's what it matters this is a little bit different where basically you're tested the whole time during training this is the common way to analyze these bandits settings and therefore we're going to stick to that here you could of course also imagine a different version where you have some a lot of time in which you can train and then you're tested but we're not going to consider that in this lecture so for those of you who are familiar with the game theory terminology sometimes this is called a game against nature because you're basically playing against the environment in some sense except the environment is indifferent so it'll just return the same distribution won't try to fool you or take advantage of you okay so this is the setting and now we can talk about as we did before about what the values of a certain action so in this case this is simply something that is dependent on the action there's no state you're always basically in the same state that's that's another way to think about it and then the true value of an action is just defined as the expected reward for taking that action now of course we can estimate this simply by taking the average whenever you pick an action you just average that reward in and then this will give you an estimate which is in some sense a pretty good one for the expected reward so I wrote that down down there using indicator functions the summations are over all of time that you've seen so far but you will not have taken this same action and every time step so there's an indicator function there that basically says I'm only counting the steps on which I selected this specific action it's just a little bit of notation and then we're dividing by the total number of times you selected that action and we're only counting the rewards for that action so it's a simple average but the notation maybe is a little bit verbose we could also of course compute this online basically storing the previous average and then just moving whenever you see new rewards you just move your average a little bit towards the new reward and if you do this with the step size down there where n now is the number of times you selected that action this will give you exactly the same result as before it will be the exact average of the rewards you've seen for this action the values the estimates this capital Q here subscript by T is no longer the true value it's the is your estimate at time T just to be aware so the true value is this small Q which has no subscript of T and then the big Q with a subscript T is our estimate now this formulation I I wanted to point out also because we'll see it much more often and it's a common way to write down more general update functions and one simple generalization that you can really consider here is you should you could consider a different step size for instance you could consider a constant step size in which case you won't have a flat average of all the things you've seen so far but the average will have it will be a weighted average with more weights to the recent things and this can be useful for instance if the assumption that your rewards distribution is fixed is not true if it slowly changes over time it might be much better to track rather than to get the flat average also this turns out to be much more useful when you do function approximation because then you can't really average things you have to just make small steps towards things as in typical when you do something with deep neural networks we typically have a step size there but we don't typically average we just have a step size that moves you a little bit towards whatever the answer is right now so again this is just a little bit of notation that will bump into more often in and then we can apply that for instance to this example so let's formalize the problem here by saying cheese is Aurora plus 1 getting a shock as a reward of minus 1 and then optimizing the cumulative reward basically means you want to get cheese as often as you can you don't want to get shocked and in this case then the action values after the experience is there on the left would be that the value of the white leaver with which I mean the leaver with the white background sorry for you and big ambiguity would be zero here because we had a plus 1 once and we had a minus 1 once and for the leaf with a black background would be minus 1 because we've only ever had a 1 minus 1 in this case so we're going to make it a little bit worse so it made sense so far to select the leaf with a white background again because the only one that ever gave cheese but what happens if you select it over and over again and it turns out it shocks you like every time so now that the the rat has pulled it four times in a row and each time it got the minus one what should the rat do what should it do the next time step anybody have a suggestion switch so it makes sense it makes sense at some point to switch however if you just look at the estimated action values the estimate for the belief with the black background is still at minus one because the only data you've ever seen for that leaver is a minus one so if you just look at your estimates and you would use a greedy algorithm it would continue to select that leave with the white background so that obviously feels a little bit wrong at some points we have an intuition that you should switch again but how do you formalize that why should you switch and that's what we'll talk about basically the rest of the lecture and can we also devise an algorithm that optimally switches between these things so how can we reason about this trade-off between the exploration and the exploitation it seems natural to someone take into account ID estimates can be uncertain in the example before the leaf with a black background had an estimated value of minus one which is more or less the best estimate we could get from the data we've seen for it because we've only ever seen a reward of minus one but we also only have ever seen one sample that so we must be a little bit uncertain about that value in some sense can we take that into accounts and can we reason about that formally and maybe can we even trade off the exploration and the exploitation optimally so in order to reason about this I'm going to introduce a little bit of extra terminology also because this is very common in the literature when talking about bandits and to do that first we just define the optimal value in this case V Star is not a function of anything normally it would be a function of state but there's only one state so it's just a number which is the value the actual maximum through action value so if you would know the true action values you would just pick the maximum of those that's your optimal value for the problem it's the highest value you can you can get on average highest expected reward now we're going to introduce a new term which is called regret which is the opportunity loss for each step so Q of a T here that's the actual value Q so this is the true value of the action you selected a T so if this is the optimal action this thing will be 0 if you have selected an action that is not actually optimal this thing will always be positive it cannot be negative because of the definition of the optimal value so it's given intuitive example in hindsight you might regret taking a to present in cycling say because there were many delays but maybe you would have regretted taking the bus even more because maybe it was completely writ locked but you might only know this after the fact so you might sample and maybe each day you could interpret as an independent sample of this so maybe sometimes you try one sometimes you try the other it's a little bit noisy you don't exactly know which one but over time you learn actually cycling gets me there the fastest basically on average even it's not always exactly the same maybe the other ones are much more noisy as well though an obvious problem with this formulation or it's not really a problem it's just a property is that the agent cannot observe or even sample the real regret but so why are we introducing it then it turns out to be useful to analyze learning algorithms we could look at an algorithm that basically trades off the exploration and exploitation in a certain way and then we can reason about what is the regret that is algorithm incurs and then we can reason about how does this regret grow over time so the goal now becomes the trade of exploration and exploitation by minimizing the total regret now note that exactly the same goal as I specified before which was to maximize the cumulative reward across the agents lifetime but the benefit of this is that we are able to talk about this thing we know that the optimal value of it would be zero we also know that zero is not actually attainable because that would mean you know the optimal action and you never select a different one but we know that there's like a local solution here which is zero which you wouldn't know for the maximum Q cumulative rewards you don't know what the optimal solution is there and we also know we want to minimize this thing so we the bigger it grows the faster it grows the worse it is and turns out that's useful to reason about again note I'll get to that again note that the sum extends beyond sing beyond episodes so it covers basically the whole lifetime of the age and it's about the whole learning process of the agent so it factors in both the learning and the exploitation essentially so you want to do good while you're learning essentially yeah yes so the assumption here is not that you can use this in your algorithm directly the assumptions just that we use this in the analysis so the agent never knows the optimal value thank you yes but the agent doesn't we we know this when we analyze the algorithm but the agent has no idea what the what your point value it's just it's just a convenience thing I'll get to this but I'll I'll answer right now the consider these two cases so in one case all of your rewards are zero or positive which means that your total cumulative reward will basically grow unbounded but you don't know where the other case is where maybe the maximum reward is zero and the other ones are negative which means that at some point maybe your total reward will stop growing and will become stationary because you've done all the bad things you've ever tried and now you know to select this actually that actually has a zero reward so these things these functions these sums of total rewards they might grow indefinitely or at some point they might saturate and stop growing as much but for the regrets turns out we can actually talk about algorithms in the sense of the regret growing linearly with time or and this turns out to be the better case they grow sup linearly over time which means that they will they will still grow because you'll still continue to select actions that you that might not be optimal to explore but they will grow much much slower than if you would just pick the wrong action every time and this distinction between linear and sub linear is harder to make in the cumulative your worst case than it is for a regressed case so it's just for the analysis to be able to talk about the properties of these things but they're kind of equivalents right we're just we've just introduced a little bit of terminology to make it more convenient to talk about later yes yes so on a related note algorithms I'll talk about later in this lecture would for instance keep into account a complete uncertainty about what you believe the expected reward could be for the agent and you can reason about how this evolves over time I'll get back to that so if your question remains after I talk about it please ask it again because it's a good question but hopefully we'll answer again always feel free to stop me whenever you want if the anything is unclear it's probably unclear from more people than just yourself so I was probably unclear if something isn't clear so please help me clarify okay so the regrets can imprints will grow unbounded but it's more interesting how fast it grows as I just said and to give an example the greedy policy has linear regret this means that an expectation the regret grows as a function that is linear in the number of steps that you've taken to make it more concrete consider the example again suppose that the actual probability of getting a cheese when you pick the leaf with the white background is 0.1 so there's only a 10% chance that you get the cheese the rat just happens to have a lucky episode essentially on the second time step over there and also let's assume that the probability of getting the cheese when you pull the leave with the background was actually point nine so the rat was particularly unlucky there on the first time step this could happen all right now the optimal value will be the expected reward when you pull the belief with a black background which in this case is 0.8 its point eight because it's plus one with probability 90 percents minus 1 with probability of 10% so in total the expected reward will be 0.8 and because I completely in the probabilities for the leaf with the backpack a white background that one actually has a true value of minus 0.8 so in this case the rack was very unlucky with its first two episodes and the the value estimates are essentially quite wrong now the value estimate for tea leave with a white background will continue to become better and better as it observes it more and more but as I noted before it will neck the rat will actually never go back to pulling the leaf with the black background because we've estimated that now as minus 1 and because we've seen the cheese at least once for the leave with the white background the estimated reward were there will neck can never actually reach minus 1 so if you're greedy you'll just continue to select this leave with the black white black brown background even though in this case it's actually two suboptimal thing to do so the regret that this rat incurs turns out to be 0.1 0.1 point 6 times T which is the difference between the optimal thing and the thing you're actually selecting which agreed gross linear as a function of the time now this specific value is conditioned on these first two episodes being exactly what they are so the rat is conditioned on the rat being a little bit unlucky in these first two episodes in general you can reason about this in expectation so there is a nonzero probability that this happens so there's a nomes your probability that your regret revolves linearly with the time in some cases the rat will be lucky and the greedy action will actually lock into the actual optimal action so in some cases the regret won't grow but because it sometimes can grow with a nonzero probability there are D X the expected total regrets still grows linearly is that clear okay now we can talk about the regrets yet a little bit of different way which turns out to be useful by looking at what is the like what is the action regret and the action regrets is defined as just what is your regrets so the expected regrets when you take that action and will introduce a Delta notation for this so this Delta a is basically it's the action regret or the gap between the true value of that action and the optimal value this Delta is 0 for the Optima action and it will be positive for any action that is not optimal now we could using this notation we can rewrite the total regret over time so note that the first sum there on the left is overall time steps so we basically consider all of the actions that you've ever selected and we look at the true value of that action and we compare it to the optimal value and we just add to the regret whenever you select the action we add that the corresponding gap and we can rewrite this as a sum of actions by taking into account how often you selected an action this is quite clearly true you just take each action you get a number of times it has been selected up this time T and you just count the gaps that way so this second sum is over typically much smaller set it's just all three actions rather than over time but these are completely equivalent and then we can just write it down just using the notation defined up there this is just basically plugging plugging in the definition of what a gap is we can basically say our total regret can now be written as a sum over actions where we basically just say the number of times you select an action times what the regret will be for that action that's probably quite intuitive but it's good to be sure you're you're following but this allows us to reason about it intuitively as well so what would a good algorithm do it would ensure small counts for any action that has a high gap a high regret if an action has a particularly low gap below Delta so it's pretty close to the optimal action or it's even optimal then you don't particularly care that this number of times you selected it is small it can be fairly large especially for the optimal one you essentially want that to be large because then your total regret would be smaller and this turns out to be useful in the analysis as well because then you can reason about okay we want one of these two to be small at least both small is always good but if say the gap is very big then we definitely want n to be small and then the question is can we can we insure that somehow so again the actual regrets are not actually known so this is more for the analysis of the algorithm but first we consider a much simpler simpler case something we also considered in the first lecture we could do something just add a little bit of randomness we could add a little bit random as to the policy essentially by picking a random action uniformly random across all of the actions on each time step with a certain probability now this is actually maybe the most common exploration strategy in current day reinforcement learning also in big applications for instance this is this was used for the results of the Atari games for the dqn algorithm and it turns out to be quite effective even if it's maybe not the most sophisticated exploration method the way it works is very simple we call this epsilon greedy and it basically means you select the greedy action with one minus Epsilon and with probability Epsilon you select a random action the random action can also include the optimal action so essentially the sorry too greedy action so essentially the probability select ingredi actually slightly higher than one minus Epsilon because you also have to factor in that you can select it when you select randomly now question is is this enough and how do we pick the Epsilon so we're already mentioned that the greedy policy can lock into a suboptimal action forever and we showed this in the example with the rats now it turns out the Epsilon greedy algorithm it continues to explore if you have a fixed Epsilon which essentially means it will always eventually find out everything there is to know about all reactions all of your action values will become accurate in the long run which is good so it will notice at some point it will basically learn which action value is truly higher than the other ones however if your absolute true is really constants this also means you have a linear expected regret simply because there's this probability and it continues to be the same probability in this setting that you'll select suboptimal actions so again the total regret hair grows linearly similar to how I did in a greedy case doesn't mean these algorithms are in some sense the same or equally good or bad I would definitely say this is better than greedy in a typical case because you do learn event you'll basically select the optimal action with probability 1 minus epsilon which is not guaranteed for the greedy case but the regret still grows linearly so you still lose something and it seems at some point unnecessary if you really know the true values of all the actions because you've selected all of them so many times do you really need to continue to explore you really need to try that action that you're pretty certain is not good the answer is no you don't have to and you can do better so sometime a couple of decades ago people were investigating this problem and they were reasoning about what is the best possible thing you could hope to get and turns out you can say something about that and turns out it's it's basically related to those action regrets we talked about before it's related to the similarity in a sense between the optimal action and all the other actions and then it turns out the harder problems are the ones where these actions have similar distributions but different means the reason being if these distributions for the rewards are similar for different actions it's hard to distinguish them and therefore it takes longer to basically find out which is the true optimal action you didn't describe this formally with something that's called KL divergence if you don't know what it is that's fine I think it was in the entry quiz but it's basically just a similarity between the distributions and then lion robins proves that the total regrets asymptotically so when T goes into the limit will be larger than this quantity there at the right hand side now this quantity has a summation over the actions with all the action gaps in there and it divides by the similarity by the distributions I would say all of that it's not that important right now the more important bit is the log T this basically means that the total expected regrets that you will incur for any algorithm will grow on the order of logarithm of time now this means that the regret will grow unbounded which is expected because you never read you're never really sure so you maybe have to continue to explore but it grows a whole lot slower than linear logarithm of T is a whole lot smaller for large T than T itself so then the question is can we find an algorithm that actually change this bound can we find something that that gets close to this because this is a lower bound this doesn't say which algorithm to run it basically says for any algorithm you will at least get this regret so now before we get to accrete algorithm that might attain that let's talk about the intuition behind what you could potentially do and I've talked already a little bit about uncertainty now let's assume we have some measure of uncertainty these are three different actions and we're fairly certain about the highly peaked one in red we're less certain about the one in the middle there in blue and we're really uncertain about the green one there at the bottom I didn't specify how to get these uncertainties I'll talk about that later but just for now assume that we have these which actually should we then pick and the argument here is that if you're more uncertain about the value of an action maybe it's more important to explore that action but maybe at first let's zoom in a little bit maybe at first we'll just be a little bit greedy and we'll try the action with the highest mean because we do think it looked quite promising and I mean there's other actions that might be up to all but this one has a good shot of being optimal as well so we select this and let's say that we get a reward that is a little bit on the low side of what you might expect and then we might update a distribution not saying how just this is just an intuition right I'll just say maybe we'll update our distribution a little bit towards that maybe we'll shrink it a little bit we're a little bit more certain of the more and more samples we get that the actual mean is there but this means we're a new situation now where the red has shifted a little bit and maybe maybe not looks more promising to select the green one for once because there is quite a bit of probability mass there on the right hand side which means that the green action might actually be optimal like the true value might be 4 or even higher for the green action for the red action that's very very unlikely right now so select the Rhian action will observe something which in this case actually is indeed higher than the mean of the green action so maybe we were under estimating the value of the green action a little bit and then we can shift the distribution now this is just to give you a little bit of an intuition of what you could do but what we were actually doing there is being a little bit optimistic about which action might still be possible especially in the second part or a selecting green action and one way to do that is to basically define an exploration bonus which we call upper confidence in this case for each action value and we'll do that so that the true value with a high probability is smaller than your current estimate plus that bonus so you'll take your current estimate you'll acknowledge that we are uncertain whether this estimate is correct and we'll add a bonus so that we're pretty certain that at least the values below what you didn't get so if your true estimate is say zero you might add a bonus we'll say ten and then basically you claim here is okay I don't know what the actual true value is but I'm pretty certain it's less than ten and then we could have an algorithm I'll define this bonus in a moment but we could have an algorithm that just greedily selects among the actions but using not the estimate skew itself but the estimates plus the bonus now the intuition here is that this bonus should be bigger for actions that were more uncertain about and especially in the simple bandit case we can do that by just keeping track of the number of times an action was selected and we can basically intuit that if we haven't selected enough action often so if this n is small then probably our bonus should be big we haven't selected it often so we want to add a big bonus so that we're pretty sure that the true value still below there but if we've selected an actually many many many many many times we don't want to add a big bonus because this will be overly uncertain we're not actually that uncertain about value anymore so small bonus is enough what this means then is that we'll select an action even if it's estimate right now is low just because we haven't selected it often and this is especially true if you think about if there's a if there's an estimate at some point in an estimate of a different action which is higher you might select it higher it valued actually quite quite often greedily but at some point the bonus will shrink for that action enough so that the bones of the other action will over overtake it and you'll select the other action just to try it out the exploration now for normal flat averages as we've discussed before the uncertainty about where the true value is typically decreases as the square root of the number of times you selected an action just this is just using the central limit theorem and small potential caveat this assumes that the variance of the reward is bounded this is typically the case but there are distributions like the Koshi where the variance is actually unbounded and in this case you basically have no hope but that's very rare so using this intuition can we derive an optimal algorithm so the algorithm idea is as follows recall that we want to minimize the total regret which I've written down here as number of times you selected an action times how bad was it the selected action that's the Delta yes yes yes yes there's a very good question so at T equals zero what should this upper bound be what is typically done is to say your upper bound is basically infinite for each action at T equals zero which in practice means you're going to select all the reactions once and only then this kicks in because then you'll have an estimate for each of them it's very good question it's also related to the assignment because you're going to have to implement something like this so rather than implementing infinities it's better just to select each of them one time let's say okay so the idea of the algorithm will be we want to minimize the product of how often you select an action times the the regret of that action so if the gap is big if this action is actually really quite bad compared to the optimal action then we want the number of times we select it to be small on the other hand if the number of times you selected an action is big we want this gap to be small we want the algorithm to have these properties now not all n can be small because they must sum to the total time you must select an action on every time step so the sum of all the number of times you selected an action must grow over time so all we can hope to do is to more often select the actions with a low gap than the actions with high gaps okay so in order to reason about this and this is what's used for the analysis of these algorithms there's something called huff dings inequality some of you might know this but if you don't this is something of a more general family of results which are called concentration bounds and what it means is we can say something about how this estimate behaves without characterizing the whole distribution and in this case what we're going to assume is that we have some random variables think of these as your rewards and we're assuming they're bounded you can also there's different qualities you could use when you don't assume they're bound about say the variance is bounded but in this case we're just going to assume the rewards are bounded we know the rewards are rolls between say 0 & 1 don't have to be between 0 & 1 you could have different bounds and typically we have but just for the simplicity of the theorem we'll say between 0 & 1 and let's say we averaged these as we were doing before so our current estimates for the expected rewards just the average reward you've seen so far then turns out we can say something something which is that the expected value of these random variables these are iid so we can just pick any one of these the probability that that thing is bigger than the the estimate you have right now plus some bonus U is bounded by this quantity on the right hand side now what does this mean it basically means that if you've selected a number of points and your average these is it an add bonus that you can bound the probability that that thing that you got there this mean Plus you that this thing is is still an underestimate and the bigger you pick you the smaller this probability will be if you pick a really big value add it to your mean right in this case it's it's almost trivially true during the the rewards are bounded so if you would add 1 you would be a hundred percent sure in some sense that you'd be overestimating but you can bound this thing with this function on the left which is now a function of the number of times you've selected you've sampled this random variable N and this gap and note that it decreases for both so the more often you select something if you consider the same bonus if you select it more and more often it becomes less and less likely that you're going to be sufficiently far off what this essentially means this will be fairly close within you of the actual true value after a while if you consider a bigger gap so if you consider a bigger discrepancy between your current estimate and the true value this will also become smaller so the higher you pick your bonus the less likely it is that you're going to offer estimate now the idea is to apply this inequality which is true for basically in general under this under the assumptions of the theorem to the to the bandit case with bounded rewards so for instance like I said if you pick the rewards will be between 0 & 1 then you can bound how far off is my estimate so the estimate there is the big cutie and we add some bonus which I haven't defined yet but let's just consider some bonus then we can bound the probability that this whole thing is still too low now how do you use this and to define an algorithm let's say we want to pick certain probability we want to pick a probability that the true value exceeds the upper confidence bounds that's what we call this estimate plus your bonus an upper confidence bound so now we can basically solve we can invert that we had this quantity on the right hand side of the previous slide and have things in equality and we just say let's pick a P and then we can solve for an upper confidence bounds and it turns out to be this quantity the square root of the minus log P divided by 2 times the number of times you selected that action now and this is just an intuitive idea I'm not actually deriving the I'm just giving you an intuition let's say we want to pick that P in such a way that it decreases over time so what does that does that mean we basically wants the probability that we're going to that we're going to be too low with our with our bound we want that to decrease over time for instance as 1 over T if you do that and you plugged it into this bound over here you get something down there which is the square root of log T divided by 2 times the number of times you selected that action so turns out this 2 is not that important the main things are the log log T and the number of times you selected action and that both of these are in the square root now what this is what does this do picking the exploration in such a way so because the the the probability that were wrong essentially that our estimate plus bound is still an underestimate of the true value it decreases over time but it doesn't actually go to 0 and this means that will continue to explore indefinitely but eventually we will will almost lock into the optimal action will select it more and more often as time goes by because our estimates get more and more accurate and we're more and more certain that the the estimates are very close to the true value so this leads to concrete algorithm where we've now defined this bonus and like I said the 2 there wasn't to importance and I basically pulled it out and I made it into a parameter C so this is the bones that we're adding to our estimates we have an estimate Q and we're going to add something that is the square root of the log of T your time step divided by the number of times you selected this specific action so the quantity R over the bar doesn't depend on action this will just grow but slowly logarithmic ly over time so what does this mean in practice it means that for this action let's consider an action that you haven't selected in a long long while it means that this bonus will keep on growing because of the log T and if you never select it for for a very long time this bonus will grow and grow and grow until it goes higher than all of the other estimate plus bonuses for all the other actions at that point you'll select it when you select it the number of times you selected this action will go up which means the bonus drops now at the same time you've got a new sample so your Pugh your estimate for this Val for this action will also change it might go up might go down maybe you were under estimating the value of this action and maybe this value goes up so maybe it actually becomes more likely that your select is action again in the future or maybe your estimate was pretty accurate and you get a reward which is very much in line with the estimate you already have in which case it doesn't change much and then the bound the bound which has gone up slowly with the logarithm and then gone down when you selected it basically will ensure that it again takes quite a long time before you select it again so what the log T essentially does it is it bubbles up all of the action probabilities each of these actions basically will get selected indefinitely again because this bound keeps on growing but whenever you select it you kind of whack it back down again and then only if your actual estimate is high will you continue to select this action this is an important property so thank you when will you select an action either when you're uncertain n is small relative to the other actions or the estimated value is high and this is a general property we basically want to select the actions that we know are good with pretty high probability or that we're very uncertain about and this algorithm has exactly that property now turns out you can analyze this and you can consider what this algorithm actually does and Peter our did this 2002 and he proved a certain specific bound on the only regret now previous bout we discussed was a lower bound on a regret this is an upper bound on the regret for this specific algorithm and the upper bound interestingly is also of order log T which means that as far as the dependence on T is concerned this is optimal you cannot do better than log T that's what the other bound proved and this algorithm actually attains that now in practice this C quantity can be considered a hyper parameter and it basically ties into what is the shape your distribution how do these distributions of the reward actually look and sometimes it's better to better just to tune this a little bit and try to see if for your specific problem you can find a better value than the default value of say square root of 2 that's Peter our suggested ya one way to think about it is that this C corresponds to do you consider like a 95% confidence or a 99% confidence it changes over time because of the way the bound is setup the actual confidence but it's related to that yes so typical values for this are 1/2 1/2 around that it's a little bit hard to intuit exactly what it means but if you just play with it you'll find that for certain problems you'll get certain certain values just perform better and it's good to start this around this expiration around say 1 yeah yes yeah oh so this is just this is a bound right sorry in in this setting it's a bound so what we're basically saying there's a certain probability that you'll be far off and for each action there is such a probability this means that the probability distribution that we're considering right what what does P mean here it's either is your bound and over estimate is your estimate plus bound sorry is that an over estimate of the true value or is it an underestimate so there's only these two possibilities and essentially what we're saying is well we want there to be a probability that the true value I mean there's always a probability nonzero probability that your true value doesn't lie within your bound right you have a certain confidence interval in some sense and basically these reasons about does your true value fall within that confidence interval or doesn't it and we want that probability to have a certain property we want that to be sufficiently we want this bound to be sufficiently wide that we're pretty certain that that that the bonus that we're giving is meaningful in the sense that we're beyond the true value but we also want it to be we don't want to be overly uncertain more uncertain than we need to be so we want this probability to be not too high either or too low so there's all these two choices whether you're below or where the actual value is below or above the estimate plus your bonus so in that sense it's properly normalized yes it doesn't actually so what the more general statement of halflings inequality basically says there's bounds a and B and then the a and B just show up in the equation below I just didn't put them here for simplicity this is like the simplest statement of the of the theorem but it easily generalizes to other bounds and there are similar concentration inequalities for other cases for a for instance your your random variables aren't bounded but they have finite variance and you know that the variances is bounded and then you can do something similar you could have got get a similar statement but this is just the simplest simplest one in a sense hmm so we typically don't change see over time so what we did here is we pick the probability which we do change over time but that may the same she turns into a constant C because the rest already changed over time the change over time is captured in this log T and in the N but the C parameter we typically just pick one and we stick with it so you don't have to decay see this is different from when you do say epsilon greedy so what I mentioned is that if you have a constant epsilon so if you have a constant random expiration throughout learning that this is maybe not a good idea because you continue to select things that eventually you know are not like good you could also consider decaying this probability over time and people often do but then you have to pick a decay schedule and you have to be careful how you pick that turns out if you do you can actually also get logarithmic regrets if you're careful on how you pick your epsilon but I won't cover it in this lecture but this algorithm kind of automatically does that because of the log T and the division by the number of times you select an action and then the C is just just regulates how quickly you learn but essentially for any CEO eventually get there and but the constants in your in how quickly you do might change so you might get there slower and I'll have an example of that a little bit later I think so this algorithm is not that hard to implement you only need the number of steps you've had so far you need to keep cut a track of counts for all of the actions I need to keep track of the average reward for all of the actions and then interestingly the algorithm picks deterministically it always there who is one action let's assume that the arc max doesn't break ties randomly but maybe if it just picks the first one then it would determine asleep pick an action in each step which is in some sense deterministically the optimal choice given all the information you have so far for the exploration exploration trade-off there was a question back yeah yes so the lower bound which has showed before that was a very general statement about any algorithm that you can't do better than a logarithm regrets I won't go into detail how they arrived but the upper bound here essentially what you can do is you can consider this algorithm and then you can just plug that into half things in equality and what you can then do is you can use that to reason about what's how n changes over time the number of times you select an action and turns out the number of times you select an action if you use this algorithm depends on how suboptimal that action is so it depends on this action regret and it turns out it depends in such a way that the product of these things is always at most logarithmic in T for all the actions and this is the case either because the the gap is very small for instance the optimal action has a gap of zero so you don't mind selecting that one indefinitely often but if the gap is very big turns out the number of times you select that action is relatively small it'll only be role algorithmic in the number of times you selected any action at all and this turns out to be true done for all the actions with this specific choice of a bound if you want to actually see the proof which is quite it's not actually that complex if you have half things in equality you can go to that paper by a Peter our on the UCB and he just proves exactly this thing there at the bottom using the hospital in equality it's a bit technical there's a lot of steps involved which is why I didn't want to put it in the slides but given half things in equality it's not that hard to step through and there's been many follow-ups to this where with different assumptions also going back to that question about why 0 to 1 or you can make many other assumptions about the distributions and for specific assumptions you can prove like slightly stronger things it never goes better than log logarithmic in time and unless you make very strong assumptions about your problem but that's that's like this was this was a landmark result because it was the first algorithm that actually achieves that bound so in the in the previous lecture I've talked about an agent codes represents internally you could represent values and then use these values to come up with actions and then maybe this is a good way to to find good policies I also talked about maybe you can put a represent these policies directly I'll talk about a little bit later and we also talked about maybe you can learn a model now learning a model in general in this course we won't touch upon that much also because it's quite it can be quite tricky to learn a true model especially of the dynamics of the world the way the world changed might be very hard and it might also be very hard to know to learn I mean it might also be very hard to know what to focus on but in this case because there is no state spacing there's no observations there's only these rewards learning a model is much simpler perhaps so what we could do we could have a reward model that basically predicts the reward for each action but that looks very similar to the value based algorithm in this case in fact you might these might be indistinguishable in some sense you could interpret in this case the action value estimates as a reward model because they're basically just a prediction of what the expected reward is when you select an action but it doesn't mean we've exhausted everything you could do with a model-based approach and in particular we could model more than just the expectation we could model for instance the distribution of rewards and this gives us Bayesian badness now the idea here is that you'll model a parameter in parametric distribution over the rewards so now there's this probability here which is a Bayesian probability so the way to an is maybe if you want as a belief that the true expected reward is at a certain point sorry I probably shouldn't have I should have probably put the other one here we're going to model the way the true value is not where the where the rewards are but we're going to model an expectation over we're going from all the distribution over where the true what the true value is and we are going to update this using the probability of the reward under your model so now theta will represent the parameters of your parametric model so for instance for a Gaussian model this could be your meaning your variance these two parameters defining Gaussian together for a multivariate Gaussian these might be the vectors or matrices but for a simple simple case it's just a mean and a variance is two numbers and this will characterize a Gaussian which could be a model for where you think the true value will lie now you can use Bayes rule to update these these distributions basically starting from a from a prior a zero we could update the model parameters each time we see your reward first there's an action and let's just keep you separate for each action so each time they're conditioned on the action so we'll just basically just have a separate model for each of the actions and we're just going to update that each time we see a data point yes yeah so I yeah we're trying to actually do that that's what I meant here when I said I should have put the other one what we're actually modeling is and it'll be in later slides as well PQ there on the left all the way basically the probability we believe the true value is somewhere sorry that I'll fix the using the data we're going to update the the probability distribution on where we think the true value is Q rather than RT we're not trying to match the distribution of the reward we're trying to learn a belief distribution over where the true value is so what does this give us well one thing it gives us it allows us to inject rich prior knowledge say that you have a Gaussian like I said these parameters theta might be your mean and your variance of your Gaussian you might pick this to be in a reasonable place you might pick the variance to be wide enough that you're pretty certain it lies within within this distribution but not too wide you might pick your mean to be somewhere maybe the mean the initial mean might be different for different actions maybe at some prior knowledge about this and this gives you one way to inject that and then we can also use these distributions to do the exploration and I'll discuss basically two different ways one is again upper confidence bounds but now using the explicit representation of these distributions and the other one is probability matching and Thompson sampling which falls under that header so let's consider an example to make things concrete let's consider a bandit with a Bernoulli reward distribution this simply means that the rewards are either plus 1 or 0 with an unknown probability so the expected reward here is exactly the probability that it's 1 and we want to basically model a belief over this probability now for instance we could pick a prior for each action to be uniform on 0 1 which basically means we don't have a specific preference we believe all of these probabilities are equally likely to be true in advance we believe each of these might happen and we don't want to pick one over the other we're basically not saying it's a higher probability that the true we're not saying that possibility of reward of plus 1 being say point 6 is any different from it being say point 2 now if you have a Bernoulli distribution on your random variables it's natural to model these probabilities and beta distributions and then the assumption that we're going to do a uniform distribution initially is equivalent to saying we have a beta distribution with parameters 1 and 1 if you don't know what a beta distribution is it's just a probability that I'll show in the next slide which with these two parameters that basically say how often have I seen it zero have never seen one and confusingly these things are all set by one so the uniform case if you want to say the uniform case basically means we have no knowledge this has two parameters one for each and then the way to update this posteriors is very simple because whenever your reward is zero you update one of these parameters whenever the reward is one you update the other one you're basically counting how often did I see is zero how often did I see a one for this specific action so that's a very simple update this is nice because in general Bayesian updates can be quite involved you might have to calculate integrals but in this case because we have a simple simple probability distribution with fixed one that's particularly easy to sit to work with then we can update it very simply by just updating the parameter of the distribution in this way now how does that look suppose we start all the way left there we have no information we make no judgment calls we say all of these true values are equally likely that's what the picture all the way on the Left means the y-axis here is the probability we assign the probability mass we assign to each of the true values and then on the x-axis we have each of these three values which spans from zero to one in this case we know the true values between 0 & 1 because we know the rewards are either 0 or 1 so the expected reward must be between 0 & 1 so there's no probability mass beyond 0 or 1 but between in this interval we make no judgments yes now suppose we see this sequence of rewards we get a plus 1 we get another plus 1 then we get a 0 and then we get another 0 all of this is for a single action just considering one action we're considering how to update this distribution now this is what will happen to the probability distribution which captures our belief about where the true value is at first you get a plus 1 so we see the distribution shifts to the 1 it doesn't go all the way there unlike the greedy case we're not committing completely we're not saying now it must be 1 but we're just saying the probability that it's closer to 1 has grown and the pro exactly zero is very small now it's zero at exactly zero but this is probably that it's close to zero it's also fairly small if we see another reward of plus one the distribution becomes even more skewed because now we have some evidence that it's more likely that the value is high then it is low but it's still fairly spread out there are still non-magical probability mass say on the values that are below 1/2 much smaller than on the values above 1/2 right now but are still quite some mass if we at that point see a reward of zero the distribution shifts we know no longer now the true value can be one so basically it falls down all the way to zero there at the end we also knew it couldn't be zero because we've seen rewards of plus we've seen rewards of zero and of plus one now and the mean is probably somewhere in the middle maybe a little bit still closer to one than it is to zero because we've seen more ones and zeros but if we see another reward of zero the distribution becomes symmetric you can see it see there's still quite a bit of uncertainty we don't commit yet strongly on where the actual true value will be we just know that it's not zero it's not one it's quite unlikely to be very close to either one of these but somewhere in the middle we'll really just don't know and that's what's captured in this probability distribution this is the beta distribution if you update the parameters as exactly shown on the previous slide now what could we then do let's say we have multiple distributions for different actions one thing we could do is very similar to what we were doing before with UCB except instead of just picking the bound which basically defines your confidence interval we could I just look at the distribution and define your confidence interval on that one so for instance a simple way to do that is you could define a an upper bound which is related to the standard deviation of your distribution that might be one way to do it especially if you if your distributions are all gaussians then basically the standard deviation is enough in a sense because you don't have to capture the shape which might be lopsided in general for gosh in a comedy ops sudden stand deviation is enough if you have a different distribution there's different ways to pick out these these values and then maybe you can do the same thing as we did before you pick the mean you add a bonus which is may be related to your uncertainty in this case on where the true value is and you pick really using that this is one algorithm that you could do but actually more common when we model this distribution is to do something that is called probability matching in probability matching we're going to do something that is maybe a little bit unintuitive or depending some people find it very intuitive but depending on how you look at it because we're going to select an action according to the probability that it is optimal so what this means is that that we're defining a policy and we're basically going to make this possible this policy equal to the probability that the true value is the maximum through value under the history here where I've basically abstracted away that this is now used this history to update these Bayesian distributions so sanction this probability here this is again a Bayesian probability right because in truth for each of the actions it is it is either the optimal action or it isn't but this probability distribution is under these Bayesian beliefs of the true where the true values lie and because you have all of these distributions not explicitly you can reason about this you can solve this you can find the probability that this action is the optimal action under the distributions that you've so far updated now it can be difficult in general to compute this analytically note again by the way that earn certain actions will have a higher probability of being max because there's a lot of probability mass may be on the till so you might be more likely to select that one which is indeed the property that we're after now because it might be unwieldy there's another thing we can do which is called function sampling and this is actually from if I say correctly out of out of top of my head is from 1933 so it's pretty old which is we're going to sample from each of these probabilities so remember we have these probabilities over the true values for each of the actions we're going to sample one for each of these actions independently and then we're just going to select greedily this means that if you're a very uncertain there's a pretty big possibility that you'll select something that's really high you're going to sample a candidate true value essentially that is really high there's also pretty big probability maybe that is really low so you won't select this action all the time but you'll select it every one so often so again either your uncertainty must be very big for you to be selected or your or your current mean value must be big that's another reason why you might be selected and so it turns out if you do that in expectation you're doing probability matching but because you need to set randomly selects in action anyway you can only select one action on each time step this is completely fine you know you're not losing anything by doing this in a sense now it turns out if you do this for say the Bernoulli bandit case that we discussed just now this also achieves the optimal lower bound on regret so this is in a sense an optimal thing to do now why is this a little bit surprising because we're selecting these actions according to the belief we have that their true value is the optimal value but that doesn't mean that we're taking into account explicitly how to regret changes over time or explicitly how much we learn from picking this action so in sometimes it's a little bit surprising that this algorithm works so well and in practices also works quite well but it does require you to get these posterior distributions from somewhere which might be a lot of overhead in complex cases might be hard to do in general but if you can do it for instance for the Bernoulli bandits there it's quite easy and then you can use Thompson sampling and it would be just as good as you should be in a sense okay so what does Thompson sampling it's like I said it's a little bit surprising that it works well because what does it doesn't on something not explicitly reason about is the value of information we could go on further and we could reason about what the learning process is because we know what the learning process is we know what we're going to do we're going to get a reward we're going to update certain internal estimates of the agents whether it be a model or a value and we can reason about what we learn from each in interaction with the world so potentially we could reason all the way through this taking into account the knowledge we have so far so this is sometimes called the value of information we can reason that we can plan through this whole three of future potential lifetimes that you might go through and we can maybe plan somehow optimally through this taking into account the assumptions we can make about the problem and maybe we could quantify how useful it is to select the certain action because then we can plan through and find out how much did we learn from this action potentially depending on the outcome and how would it change our later actions in terms of the learning process so for instance if you want to quantify the value of information it makes sense that we would gain more information when you try something that you're uncertain about if you're already very certain say about a certain action value then you basically ain't no information if you select that action again you already knew so this isn't sometimes not the optimal thing to do so according to the value of information it might make more sense to try uncertain things more and if we know the value of information then we can trade up exploration and exploitation again if we can quantify these values somehow in terms of the future expected reward so what is the value of information essentially the value of information is how much can I gain later if I gather this information now I'll try something that is insert and this will give me some knowledge and I might exploit that later and I can reason about how much I can exploit that later how much it gains and so what this means is so far we've viewed Bandits as a one-step decision-making problem every episode lasts for exactly one step and there's no interaction between the episodes but we could also view it as a sequential decision making problem by reasoning all the way across this future that might it might happen taking into account our own learning process so now rather than trying to plan through what the environment might do we're planning through what might happen inside the brain of the agents in a sense so how do you do this then well you would have to define an information stage summarizing all the information accumulated so far and then for each action we would call the transition to a new information state by adding certain information with probability that you can define so if you have a probability model if you have a model on your rewards say you can reason through without actually executing an action for each action you can say if I would select that action I believe it's doesn't just likely that I'll get a plus one or a doesn't as likely that I get a minus one and then I can reason through the resulting States internally in the agents and I can reason through what the agent in both of those cases would then do so in both of those cases you might select a different action depending on where they're also plus 1 or a minus 1 or plus 1 and a 0 or a 0 in each of these cases we can consider what the agent would do next and again you might select the different action end which might again give you a plus 1 or a 0 in either of these cases so you get this expanding tree in addition to the random outcome of the actions you can reason through how likely am I to select each action if you're using a probabilistic method if you use UCB the actual action selection is deterministic but the outcomes are still random so you will still have an expanding tree of possibilities but if your policy is also stochastic as it will be for Thompson sampling then in addition there is an expansion for each step which is related to how many actions there are so this becomes a huge tree at some point and it's very hard to reason through but if you might if you make if you can make certain assumptions you can reason through this whole thing so essentially what we have here is a Markov decision process which may or may not be known and essentially for each internal model you would have a different mark of decision process for each internal way you select your actions you would have a different interest decision process in a sense and then you can reason through this you can basically try to solve this thing so now even in bandits previous actions select the effective future not because they change anything in the world not because it changed the state of the environment but because they change your knowledge now for instance for even a Bernoulli bandit I use mu here to basically define the probability this is your mean of the Bernoulli distribution which in this case is also just your your value then the probability of getting a 0 is 1 minus mu so for instance this might be winning or losing a game with this probability then the information state might just be how many times for each action did I get a 0 how many times did I get a 1 so what we're putting in there is prior knowledge is essentially that we know that the rewards are 0 R 1 and then we can reason all the way through this now we can formulate this as I said as an infinite MVP that goes infinitely into the future where we reason about if I do this then this might happen and then I can reason about what happens next because then this change is my mind for the next step and so on but you could also solve this with model free reinforcement training you don't have to plan through everything you could also just sample and reason through it that way this has been done in the past or you could use model base 3 and 4 is returning it's a reason through how these states these information states change over time this becomes unwieldy because the tree can be quite big but we'll talk about in the next lecture about things like dynamic programming techniques where you don't actually build the whole tree but you do this step by step and this can be more efficient so the latter approach where you do the model base thing is known as Bayes adaptive reinforcement learning and of course you still need to define a priority distribution you need to put in your prior knowledge that you have about a problem what do you assume at the beginning about where the true values lie and your solution will be conditioned on that but then if you pick a right prior this can be quite quite good in some sense optimal given your prior but of course it can be very unwieldy and it's on clear how the skilled is especially later when we'll consider problems which are not bandit problems where there is now also an environment state and there are observations so this we won't go much more in-depth into this but the main thing to take from this is that you can define is you can reason through your learning process you could if you wanted to whether it's a good idea that's more of an open question okay so now I want to change gears again and I want to talk about we talked about values we're talking about models now I want to talk about policies and this will be important for several reasons one is it'll show up in the assignments which may or may not be important to you another one is that this is an approach that does scale whereas the previous one with the reasoning through your whole learning lifetime might not scale as easily this is something that you can apply to problems at scale and has been applied to problems that skill so what is the idea we want to learn a policy directly we're not going to learn a value function necessarily we're not going to learn a model we're just going to parameterize a policy we're going to try to learn that and then the question is how can you do that so for instance we can define a probability distribution on the actions as such this is called a soft Max or sometimes Boltzmann distribution the lower part here is just a normalizer and it basically means that your probability of selecting an action is now proportional to the exponentiate preference these H I don't particularly like the letter H for this because we've also uses for history but it's the letter that's used in certain umberto so I wanted to be consistent with that this just means preference it doesn't mean value these HS don't necessarily have to correspond or converge to the value of selecting direction it's just a number that says it tells you this is how much you prefer this action to the other so we've now parameterize the policy by just assigning a number to each of the actions and then defining a policy like this this is sarcastic policy this means that the probability of selecting a certain action is between 0 and 1 if the preference of one action is much higher than for all the other actions the probability for selecting that action then in the limit goes to 1 and the probability of selecting the others goes to 0 we're going to view these preference basically as learner more parameters and then we're going to the reason about how can we optimize them yes so the question was whether these preferences are whether they have the same relative relationship to each other as values would have and the answer is no for instance one one way to think about that is you might you might know that a certain action needs to be the optimal action there's multiple ways to do that here but the only thing you need is that the preference for that action is much much higher than done for all the other ones and especially all the other preferences the relative ordering doesn't even have to be the same as for the values all that we need is that their preferences are much lower than for this one other action so even the rather like even the ordering of the preferences might be different this would be slightly wrong but if one action like completely dominates because the preference is much much higher it wouldn't actually show up in your policy and the policy doesn't care one other things maybe note is that you can actually add anything to these preferences in this case and it would disappear from the probability you can you can shift them up and down and it would disappear from the probability it's fairly easy to show that one showed here but it again shows that these things don't have the semantics of a value function yeah sorry what so do you mean how can we learn the preferences or how what do they mean also so what influences the preference of an action if it's not the value so I'm going to show you an algorithm that updates these preferences directly and basically what I'm saying here is that it will not be equivalent to learning the values these preference won't go necessarily to the values we won't even learn the values but what we will do is we're going to update these preference so that the resulting policy will attain higher value so in some sense the values to influence this influence these preferences in the sense that we're going to sample rewards and these rewards will be higher low and the preferences for certain taxes might go up or down depending on the rewards so definitely the reward still influence your your preferences and the true value still influences your preference in the sense that for instance eventually we want the preference for the optimal action which is the action with the highest value to also become the highest preference but there's no direct tie in the sense that we can write down these preferences as a simple function of the value it also depends on the algorithm that you run to update these preferences here I'm just parameterizing right I haven't yet given you an algorithm to update these preferences I'm just saying this is one way you could write down a policy a parametric policy this is one way to parameterize that where we just parameterize it through these preferences and then a question is this is basically a concrete in Senshi a ssin of a parametric policy the question is how can we learn these things in general and for this one specifically I'll talk about both but first I'll talk about the general idea so the idea is to update the policy preference sorry the policy parameters which might for instance be those preferences so that the expected value increases for instance we can consider gradient descent because we have a lot of knowledge about how to do gradient descent on things this seems like a valid and good algorithm to try and what will you then think you want to do we want to do gradient as cents on the expected value we're doing a sense rather than descends because we're not reducing a loss we're increasing a value so in the bandit case this is the algorithm that we want we have some policy parameters which I've now called theta just think of this as a vector which for instance might just be all your action preferences in the case that we discussed before but I'm talking about a slightly more general case here then we're going to add something to these parameters which is a step size times the gradient of the expected reward conditional knows parameters now I didn't put that here but these parameters of course somehow influence your policy and the previous slide gave you a concrete example of that but here I just wrote it down in a generic way where we basically say this expectation of the reward is conditioned on these policy parameters because these policy parameters define your policy and therefore the expectation changes if you change these parameters so these are the policy parameters that we're going to want to update but now the question only becomes how can we compute this gradient or is it even possible because we now have a gradient of an expectation but we don't lower the expectation we don't have that in general so now here's an important trick and I'll show it again in a later lecture as well but I'll show it here for the MANET case which is sometimes called the log-likelihood trick and in reinforced learning it's sometimes also known as the reinforced trick because Ron Williams had a paper that that that used this to do policy gradients and I'll step through this because it's kind of it's it's kind of cool for one and it's important to understand what's happening so what I'm doing here is I take that thing that we want it's all the way there at the left top left it's the gradient with respect to the expectation of the reward and I'm just going to write this out first thing I do I'll be explicit about the policy why is this an expectation well it's an expectation because of two reasons we have a stochastic policy and then this will give us an action and then given that action the reward itself might be stochastic so first we'll pull out the action and we're basically writing down the policy there which says okay there's a probability of selecting this action and then what remains is the probability of the reward conditional net action now note that this no longer depends on your policy parameters anymore because we've pulled out all the dependency of the policy parameters by being explicit about the probability of selecting the action also know that this thing you expect a reward conditioned on an action it's just the true action value so we can write that down as Q this is just plugging in that definition thank you another thing that we've done here is pulled inside the gradient signal to nab the nabla because Q as I said doesn't depend on your policy this is already conditioned on your action so if you know your action the expected reward no longer depends on your policy because we're doing the bandit case there's no sequential structure right we're just saying for this action this is the expected reward now on the next line we're doing something a little bit weird we're multiplying with one essentially we're multiplying with the probability of the action divided by the probability of the action of course this is valid because this is one what we're doing this because we can then rewrite and the way we'll rewrite this will pull the probability of selecting the action all the way in front again and the reason to do this is because then this sum becomes an expectation again so what I've done I've pulled the division out and I put it near the gradient of this policy and I pulled the probability of the action all the way to the front which then means by definition we can write it again as the expectation there at the bottom one other thing I did down there which you can do I could have just kept Q there because the expectation of the reward this expectation is over everything again this expectation of the reward sorry I should have conditioned this on theta by the way this is not an expectation depending on your policy parameters but the expectation of reward is the same so expectation of the true value by definition so I wrote the reward again rather than the true value and otherwise we get rid of this policy probability at the beginning because this is now defined defining the expectation where its expectation over your policy and the other part remains so the part that remains is the gradients of your policy divided by the probability of selecting that action now turns out and this is the final line here you can write down that this is a general truth the gradient of something like the derivative of something divided by that something itself is the same as the derivative of the logarithm of that something that's just by definition the gradient of the logarithm is so if you have log X say you take the derivative with respect to X this is 1 over X and therefore this is an equality there at the end and people typically write it down this way with the gradient of the log probabilities rich in his book prefers the other way where he keeps these explicit division by policy but these are the same so why are we going through these oops why are we doing this does anybody have an idea yeah exactly now we have an expectation on the outside very good so now we can sample this thing we couldn't sample before because the expectation wasn't all the way at the outside we had a gradient of an expectation and sampling the thing inside that doesn't even make sense we could sample the reward but what is the gradient of the policy parameters with respect to the reward that's not a thing the reward is just a scalar so that didn't make sense but now that we're all the way at the end we have an expectation around something so we can just sample that thing inside and this will be an unbiased estimate for the thing we all the way all had all the way at the beginning so this is pretty cool now we can get an unbiased sample for the gradient of the value so this is just summarizing the result that we had on the previous slide the gradient of the expected reward is equal to the expectation of the reward times the gradient of the log probability of selecting the action and we can unsampled it'ss turn it into an algorithm so the gradient descent algorithm simply becomes this where we update the policy parameters using step size times the sample three Ward times the gradient with respect to the policy parameters of the log probability of selecting the action that you did it's important that this is the action that you actually took because the reward is for that action right you sampled a certain reward for that action and this is now stochastic gradient ascent the true value of the policy I was talking about gradient descent first this is the stochastic version there off and we know in practice we often do so cast a gradient descent and this works just fine in a sense you just have to make sure that their steps don't are too big and in the end you'll find something where this gradient is zero hopefully and in simple cases if your function is convex say or in this case we're doing SN so concave then you'll find an optimum in general you might not find an optimum I get stuck somewhere in the slope optimum but still this is the nature of gradient descent and ascent so that can't really be prevented but you get all the benefits of gradient descent and particularly I'm going to come back to that later what's nice about this is that your policy can now be in a very complex thing it can be a deep network because we know how to do gradients on deep networks and that's something that you couldn't get with all the more involved algorithms we've covered before note that we don't need a value function estimate we can just use the sample rewards so we don't have necessarily have an explicit notion of value we just have a parametric policy now can we can instantiate that to make it maybe a little bit more clear let's go back to the case where we have a specific policy namely one that is parameterize with these preferences and then we can look at this algorithm and see what it looks like now turns out for the softmax and i encourage you to try to derive this yourself I think rituals has a derivation in the book it turns a little look like this the partial derivative of the log probability of selecting an action with respects to the preference of a potentially different action note 80 and a these are not necessarily the same in all turn out that this thing becomes the indicator whether a is 80 minus the probability of selecting that action a now this is a little bit opaque I find so I find it much clearer if I write it out like this at the bottom essentially what it means is that you you've selected a certain action and what you'll do is your update all the preferences interestingly this is because your policy is just parameterize with these preferences and one way to make this action for instance let's say you want to make this action more likely there's two ways to do that you could either increase the preference for this action or you could decrease the preference for all the other actions turns out in general the algorithm does a little bit of both it pushes down the other actions and it pushes up this action this again shows the seaport these preferences are values you've learned nothing about the value of the other actions but you can learn yet you prefer them a little bit more or a little bit less and it turns out the way it works is that you're going to update your preference depending on the reward so let's consider a simple case let's say your rewards are either plus 1 or minus 1 you've selected a certain action let's say the reward was plus 1 what will happen here is the preference for that action will increase and it will increase proportional to how big so basically what mine is how likely were the selected action if you were already going to selected action with probability pretty much 1 there will be little change to your preference it was a good idea you had a reward of +1 but you're already selecting it all the time you won't change it much if your your pervasive selected action was very low you might make a bigger adjustment to your preference because you saw something good it wasn't likely that you're going to select this action but there's something good emerged so you're going to update it quite a bit the other preferences if you get a reward of +1 they basically go down that again the rewards here are plus 1 or minus 1 so if you select something and it was a good idea you might want to select the other one's a little bit less if conversely the reward was minus 1 then the preference for the action you selected will go down and yeah a preference for the actions you didn't select will go up in total this will mean that you're going to select actions that were a good idea more often and actually that were a bad idea less often this gives you a nice intuitive algorithm but remember this is just derived right we've parameterize the the policy in a certain way but we're just doing grading the central value there's an intuitive algorithm here that tells you all you shift your preferences this way or that way depending on the outcome but it would just wanna merge from doing the derivation of the algorithm there are slightly less intuitive cases for instance if all of your rewards are strictly positive what this will do is it will always push up the preference of the action you've selected by nature of the updates however it will push up the actions of the Preferences of the actions that you've sorry the actions with a high value it will push up more than the preference of the action with a low value and then in total it again will come to the conclusion that it will only select in the end the action with the highest value even though it pushes all the preferences up so sometimes this is a bit less intuitive but it's still the same same result in the end it just shifts your preferences okay this is clear yes so if I paraphrase correctly hurrah so the way you would implement this if you have say you want to solve a very hard problem so maybe there's pixels coming at you it's hard to say what is actually need needed to be captured about this problem so what we want to use is a deep neural network and just learn the structure and then one way to set it up is that the input is just your pixels you go through several layers and at the output you have a vector the semantics of this vector is then the preference for each of the actions and then you can use exactly this algorithm to learn these preferences the gradient will flow all the way into the network because right now we're we're parameterizing with these preferences themselves but the general statement with this like the full weight vector of your neural network all of the weights of your neural network you can update all of them all right so you can just push a gradient into your network with backprop update all the parameters appropriately and it will slowly change these preferences to go towards higher rewards that's exactly what happens and this is not just a theoretical possibility this is what's actually used a lot in practice yeah very good question so how would you deal with continuous action spaces well the action preference version maybe doesn't apply as easily you can still define action preferences but you have to somehow define a probability space this is perfectly possible and you can but this thing also applies more generally in a sense so one thing you could do to give give to two concrete examples that I'll also get back to later in the course one is you could just define a Gaussian probability distribution which is still a stochastic policy which is still parametric and you could do exactly the same trick your gradients will look different for idea.the this this slide specifically is specific to the softmax preference version but we just need your policy to be some sarcastic policy that has a well-defined log pi for each action which it will because it's a probability distribution and then you could also apply this to say Gaussian actions in maybe a high dimensional space the other thing you can do is you can do something similar but different when you have a deterministic policy that just suggests one continuous action I won't explain exactly how that works but you can do that and there's also algorithms that can use gradients to update those it's a good question Thanks yeah and actually this has been applied to balancing pendulum things like that for those you typically need to continues actions you can do this with discretizing your action space and then use something like this but that doesn't make a lot of sense in some cases because you want you essentially want where you want to use continuous axes you want these also generalize across the action space and this turns out to be very useful but I'll talk about continues actions later in the course more we're almost running out of time so I'll quickly this is still important to cover note that the sum of the probabilities always one or the integral if it's continuous actions what this means is that you can consider any B which does not depend on the action and does not depend on your policy parameters and then the sum of B times the gradient of the problem of the often policy will be zero this is kind of a trivial proof just takes a couple of steps because we basically pull out that sum of probabilities which is guaranteed to be one therefore the gradients cannot move this sum so the gradient must be zero but what this means is important because what it means is we can actually add a baseline to the reward now one way to think about this baseline is the expected value across actions that's a common choice actually rich in chapter two he'll he'll propose to just use the average reward over all that you've seen so far as a baseline now what is the purpose of this baseline it doesn't change the expectation because of this treeline proof over here so the expected update is still gradient ascent on your value but what it does do it reduces the variance potentially so instead of updating your preferences let's go to the case where all your rewards are strictly positive instead of always updating your preferences up which will be otherwise the case what this algorithm will do it'll move them up if they're better than expected and it'll move them down if they're worse than expect this we're expected is now defined by this baseline his baseline is just something you estimate and it doesn't necessarily have to be that accurate because you're just using it to reduce the variance of your updates so it doesn't have to be the true value you're for your policy and this figure shows is from Chapter two so feel free to like look at it in more detail there this shows the effect that can have on the performance with a baseline it's much better than without the baseline for this specific setup the details of the setup are in Chapter two they're not on the slides so okay so I think we need to vacate the room so the one thing I wanted to say before we leave is so you can go to the slides there's like a small back to the example I hope I calculated this correctly but they gave the the rats the probabilities for each of these algorithms and I'll see you next Tuesdaylet's get started welcome to the second lecture on the reinforcement learning this one's about exploration and exploitation and I'll go into depth about what I mean when I say those words I said something about that last lecture in terms of logistics just to make you aware just because of scheduling reasons we'll have a couple of reinforcement learning lectures now in fairly close proximity so next Tuesday there will be another during first learning lecture and then Thursday again so we'll have a little bit of a sprint you could say on the reinforced learning side of this discourse and then there will be a couple of deep learning lectures in a row so I'm not sure whether that's more help for or less helpful for you to grasp the material maybe it actually works quite well but I just wanted to make you aware also the so somebody noted last time that the deep learning assignment wasn't out yet so this was indeed a mistake on our on our end so we also extended the deadline for that and put it out reinforcement learning assignment also first reinforcement learning this time will come out in the weekend alright the end of the weekend basically before Monday so just to be aware of that and it'll mostly be about the contents of this lecture so if you want to pay extra attention because of the assignments maybe this is the one you want to pay attention to okay the background material which is useful to read for this lecture is certain umberto chapter two although I will be covering some stuff that is not in that chapter most notably on the Bayesian methods as you'll see later on in this this lecture but a lot of material he gives us a lot more background I mean being a textbook there's a lot more space to go into depth in some of these things so I highly recommend reading that chapter fairly soon maybe after this lecture and you'll see where we potentially disagree just to recap what we talked about last lecture this is the generic reinforcement learning setup there is an agent who observes the environment of the agents which could be the world at large say or it could be some small problem that you're trying to solve maybe a small mark of decision problem but in any case this these environments accepts actions from the agent the agent acts in these environments and environments in some sense responds by sending out a new observation or as I said before you could also interpret this as the agent basically putting this observation in if you prefer but in any case you get this interaction loop and then the idea is to optimize this interaction loop so in addition there will be a reward signal and then reinforcement learning is the science of how to learn to make decisions when you want to optimize this reward signal or maybe more generally when you want to learn about the world when you want when you want to learn to say predict future observations even if you're not optimizing anything and we talked about what could an agent then learn an agent could learn a policy a value function and or a model in fact will bump into each of these again in this lecture and the general problem involves taking into account time and consequences so especially this last bit is important because your actions might not just change your immediate reward but they might also change in some sense the world which means that then later decisions are affected because something has changed now it could be as simple as you moving somewhere means that the next state you are there rather than here which is a simple change in the environment it could also be that you change something structurally in the environment so that if later you come back to a certain situation it is now different than it was before so these decisions in general can affect the reward the agent state the internal state of the agents which includes for instance its memory and the state of the environment however in this lecture we're going to simplify things because in the simpler setting we can already meaningfully talk about exploration and exploitation how to make these decisions so what we're going to do is we're going to take away the sequential structure which means that past actions will now no longer influence the future states of the environment actually you'll have the opportunity to interact with a problem again and again without changing the problem formally what we'll have is well have a distribution of rewards where this is version will be identical for a given action across cross time so you'll basically be able to query you you can output an action you'll get a reward back you'll be able to query the system again and again and again without changing that reward distribution that's different from the general case and it's a definitely a simplification but it turns out that the problem that results from this is still rich enough to talk about many things and in fact the simplification also makes it possible to talk about some things that are harder to do in the fool case for instance we could talk about how to optimally solve the decision-making problem so this is a simple example of that just consider there's a rat and this rat has access to two levers a black one and a white one I realize now I was a little bit ambiguous in in this lecture when I say a black lever I think I always mean the one with the black background rather than the lever itself being black I'll get back to that whenever I actually call them a certain color just to make sure that we're we're on the same page so a Monday this rat pulls the black lever and gets a shock okay so maybe that's a bad leaf or maybe you just don't want to do that on Tuesday pulls the other lever and let's just assume there's just these two so Tuesday pulls two white leaves the one with a white background and some some latch opens and a piece of cheese falls out so the rats happy so then on Wednesday again the rat is given this choice you can pool either of these levers so what are you going to do does anybody have a suggestion for what the red should do yes I would also pull the white one if I were the rat but let's assume that the rat does pull the white one but now gets a shock so the question is okay what should you do next now maybe in this case you have a strong intuition for what your ass should do maybe some of you even disagree on what it what the rat should do maybe we can have a very quick poll could you raise your hand if you would pull the belief with the black background okay there's a small minority of course some people might just not raise their hand on either of these choices okay we'll get back to this example in depth later just keep it in mind so the trade-off between exploration and exploitation is essentially the trade-off between maximizing performance this is what we call exploitation and increasing your knowledge about the problem which we call exploration so this is a fundamental problem in online decision-making because you're actively collecting your data as I talked about in the last lecture this is different from getting a data set which you don't you have no means of changing anymore now we do we basically sample this is very related to research fields called active learning if you know about that where you're also in charge of collecting the samples but that's basically all I'm going to say about that these are very just be aware that these are very related if you happen to know anything about active learning if you don't that's perfectly fine it also means that the best long-term strategy may involve short-term sacrifices sometimes you want to try out try something that you think is not such a good idea you might not be completely sure you think it's probably a bad idea but you're going to try it anyway just to see or the alternative version of this which is similar but maybe feels a little bit different sometimes you might not pick the thing you think is optimal just to try something else even if you don't necessarily think that something else is bad so the main difference here is whether you think that the thing is bad that you're trying or not but in reinforcement learning is basically just means whether your rewards are high or low doesn't really matter you're going to pick something which has the highest reward you think right now or something that has a little bit of a lower reward and sometimes it makes sense to pick something that has a little bit of a lower reward at least according to your current estimates simply because you're not sure and it might be that you're wrong and therefore you want to learn more about the world so that means we want to gather enough information to in the end make the best overall decisions and we'll formalize what that means later so the setting to start with that is often called the multi-armed bandit setting and this is in an analogy to one-armed bandit if you happen to know that phrase if you don't it means it's basically a different way to talk about slot machines in a casino you have a slot machine there's a lever that you pull and then you either get some money or you don't this is sometimes called a one-eyed bat one-armed bandit because in the end in the long run it steals your money and then the multi-armed bandit you can basically think of has a long long row of these slot machines whereas each decision are corresponds to one of these machines and they may have different payoffs and in addition just to maybe be aware that we don't take the analogy too far in this case some of these slot machines might actually be a good idea so you don't necessarily lose money by playing playing these it's just an analogy but it's one that's very sticky so multi-armed bandit is a very common phrase for these problems and for me we just have a set of known actions it's just a discrete set at first there are exchanges in the literature two continuous sets where this is the real value but we are just going to consider consider a discrete set so there's just a limited set of these things you can you can consider a limited set of actions and whenever you try one of these actions and you can only try one on each time step you'll get a certain reward the reward here is assumed to be random so there's a distribution and what we're going to assume about this distribution in this case is that it's fixed but unknown so the fixed property of this distribution is basically what I meant when I said before we're going to get rid of sequential structure the distribution of the rewards will not be affected by your past action it's only affected by your current action so whatever you did in past doesn't matter if you're going to pick this action you'll get a certain reward with a certain probability and now we're going to formalize the goal as maximizing the cumulative reward but there is something that's importantly different from the previous lecture because in the previous lecture we talked about optimizing cumulative reward into the future but right now we're actually talking about not just within an episode but across the agents lifetime we want to accumulate all of the rewards and we basically want to maximize that cumulative sum what that means is that we're basically not going to give too much allowance for learning essentially we want the learning to be in some sense optimal that we don't want to lose too much in the beginning while we're learning to do better at the end in fact we're going to just fold all of these things into one and we're going to say you want to basically do as best as you can across your lifetime including the part where you're learning in fact we might assume that you'll learn the whole time you never really stop learning so this is a little bit different from your typical train tests split where you allow a train time to do anything basically you don't care what your losses at train time but a test time that's what it matters this is a little bit different where basically you're tested the whole time during training this is the common way to analyze these bandits settings and therefore we're going to stick to that here you could of course also imagine a different version where you have some a lot of time in which you can train and then you're tested but we're not going to consider that in this lecture so for those of you who are familiar with the game theory terminology sometimes this is called a game against nature because you're basically playing against the environment in some sense except the environment is indifferent so it'll just return the same distribution won't try to fool you or take advantage of you okay so this is the setting and now we can talk about as we did before about what the values of a certain action so in this case this is simply something that is dependent on the action there's no state you're always basically in the same state that's that's another way to think about it and then the true value of an action is just defined as the expected reward for taking that action now of course we can estimate this simply by taking the average whenever you pick an action you just average that reward in and then this will give you an estimate which is in some sense a pretty good one for the expected reward so I wrote that down down there using indicator functions the summations are over all of time that you've seen so far but you will not have taken this same action and every time step so there's an indicator function there that basically says I'm only counting the steps on which I selected this specific action it's just a little bit of notation and then we're dividing by the total number of times you selected that action and we're only counting the rewards for that action so it's a simple average but the notation maybe is a little bit verbose we could also of course compute this online basically storing the previous average and then just moving whenever you see new rewards you just move your average a little bit towards the new reward and if you do this with the step size down there where n now is the number of times you selected that action this will give you exactly the same result as before it will be the exact average of the rewards you've seen for this action the values the estimates this capital Q here subscript by T is no longer the true value it's the is your estimate at time T just to be aware so the true value is this small Q which has no subscript of T and then the big Q with a subscript T is our estimate now this formulation I I wanted to point out also because we'll see it much more often and it's a common way to write down more general update functions and one simple generalization that you can really consider here is you should you could consider a different step size for instance you could consider a constant step size in which case you won't have a flat average of all the things you've seen so far but the average will have it will be a weighted average with more weights to the recent things and this can be useful for instance if the assumption that your rewards distribution is fixed is not true if it slowly changes over time it might be much better to track rather than to get the flat average also this turns out to be much more useful when you do function approximation because then you can't really average things you have to just make small steps towards things as in typical when you do something with deep neural networks we typically have a step size there but we don't typically average we just have a step size that moves you a little bit towards whatever the answer is right now so again this is just a little bit of notation that will bump into more often in and then we can apply that for instance to this example so let's formalize the problem here by saying cheese is Aurora plus 1 getting a shock as a reward of minus 1 and then optimizing the cumulative reward basically means you want to get cheese as often as you can you don't want to get shocked and in this case then the action values after the experience is there on the left would be that the value of the white leaver with which I mean the leaver with the white background sorry for you and big ambiguity would be zero here because we had a plus 1 once and we had a minus 1 once and for the leaf with a black background would be minus 1 because we've only ever had a 1 minus 1 in this case so we're going to make it a little bit worse so it made sense so far to select the leaf with a white background again because the only one that ever gave cheese but what happens if you select it over and over again and it turns out it shocks you like every time so now that the the rat has pulled it four times in a row and each time it got the minus one what should the rat do what should it do the next time step anybody have a suggestion switch so it makes sense it makes sense at some point to switch however if you just look at the estimated action values the estimate for the belief with the black background is still at minus one because the only data you've ever seen for that leaver is a minus one so if you just look at your estimates and you would use a greedy algorithm it would continue to select that leave with the white background so that obviously feels a little bit wrong at some points we have an intuition that you should switch again but how do you formalize that why should you switch and that's what we'll talk about basically the rest of the lecture and can we also devise an algorithm that optimally switches between these things so how can we reason about this trade-off between the exploration and the exploitation it seems natural to someone take into account ID estimates can be uncertain in the example before the leaf with a black background had an estimated value of minus one which is more or less the best estimate we could get from the data we've seen for it because we've only ever seen a reward of minus one but we also only have ever seen one sample that so we must be a little bit uncertain about that value in some sense can we take that into accounts and can we reason about that formally and maybe can we even trade off the exploration and the exploitation optimally so in order to reason about this I'm going to introduce a little bit of extra terminology also because this is very common in the literature when talking about bandits and to do that first we just define the optimal value in this case V Star is not a function of anything normally it would be a function of state but there's only one state so it's just a number which is the value the actual maximum through action value so if you would know the true action values you would just pick the maximum of those that's your optimal value for the problem it's the highest value you can you can get on average highest expected reward now we're going to introduce a new term which is called regret which is the opportunity loss for each step so Q of a T here that's the actual value Q so this is the true value of the action you selected a T so if this is the optimal action this thing will be 0 if you have selected an action that is not actually optimal this thing will always be positive it cannot be negative because of the definition of the optimal value so it's given intuitive example in hindsight you might regret taking a to present in cycling say because there were many delays but maybe you would have regretted taking the bus even more because maybe it was completely writ locked but you might only know this after the fact so you might sample and maybe each day you could interpret as an independent sample of this so maybe sometimes you try one sometimes you try the other it's a little bit noisy you don't exactly know which one but over time you learn actually cycling gets me there the fastest basically on average even it's not always exactly the same maybe the other ones are much more noisy as well though an obvious problem with this formulation or it's not really a problem it's just a property is that the agent cannot observe or even sample the real regret but so why are we introducing it then it turns out to be useful to analyze learning algorithms we could look at an algorithm that basically trades off the exploration and exploitation in a certain way and then we can reason about what is the regret that is algorithm incurs and then we can reason about how does this regret grow over time so the goal now becomes the trade of exploration and exploitation by minimizing the total regret now note that exactly the same goal as I specified before which was to maximize the cumulative reward across the agents lifetime but the benefit of this is that we are able to talk about this thing we know that the optimal value of it would be zero we also know that zero is not actually attainable because that would mean you know the optimal action and you never select a different one but we know that there's like a local solution here which is zero which you wouldn't know for the maximum Q cumulative rewards you don't know what the optimal solution is there and we also know we want to minimize this thing so we the bigger it grows the faster it grows the worse it is and turns out that's useful to reason about again note I'll get to that again note that the sum extends beyond sing beyond episodes so it covers basically the whole lifetime of the age and it's about the whole learning process of the agent so it factors in both the learning and the exploitation essentially so you want to do good while you're learning essentially yeah yes so the assumption here is not that you can use this in your algorithm directly the assumptions just that we use this in the analysis so the agent never knows the optimal value thank you yes but the agent doesn't we we know this when we analyze the algorithm but the agent has no idea what the what your point value it's just it's just a convenience thing I'll get to this but I'll I'll answer right now the consider these two cases so in one case all of your rewards are zero or positive which means that your total cumulative reward will basically grow unbounded but you don't know where the other case is where maybe the maximum reward is zero and the other ones are negative which means that at some point maybe your total reward will stop growing and will become stationary because you've done all the bad things you've ever tried and now you know to select this actually that actually has a zero reward so these things these functions these sums of total rewards they might grow indefinitely or at some point they might saturate and stop growing as much but for the regrets turns out we can actually talk about algorithms in the sense of the regret growing linearly with time or and this turns out to be the better case they grow sup linearly over time which means that they will they will still grow because you'll still continue to select actions that you that might not be optimal to explore but they will grow much much slower than if you would just pick the wrong action every time and this distinction between linear and sub linear is harder to make in the cumulative your worst case than it is for a regressed case so it's just for the analysis to be able to talk about the properties of these things but they're kind of equivalents right we're just we've just introduced a little bit of terminology to make it more convenient to talk about later yes yes so on a related note algorithms I'll talk about later in this lecture would for instance keep into account a complete uncertainty about what you believe the expected reward could be for the agent and you can reason about how this evolves over time I'll get back to that so if your question remains after I talk about it please ask it again because it's a good question but hopefully we'll answer again always feel free to stop me whenever you want if the anything is unclear it's probably unclear from more people than just yourself so I was probably unclear if something isn't clear so please help me clarify okay so the regrets can imprints will grow unbounded but it's more interesting how fast it grows as I just said and to give an example the greedy policy has linear regret this means that an expectation the regret grows as a function that is linear in the number of steps that you've taken to make it more concrete consider the example again suppose that the actual probability of getting a cheese when you pick the leaf with the white background is 0.1 so there's only a 10% chance that you get the cheese the rat just happens to have a lucky episode essentially on the second time step over there and also let's assume that the probability of getting the cheese when you pull the leave with the background was actually point nine so the rat was particularly unlucky there on the first time step this could happen all right now the optimal value will be the expected reward when you pull the belief with a black background which in this case is 0.8 its point eight because it's plus one with probability 90 percents minus 1 with probability of 10% so in total the expected reward will be 0.8 and because I completely in the probabilities for the leaf with the backpack a white background that one actually has a true value of minus 0.8 so in this case the rack was very unlucky with its first two episodes and the the value estimates are essentially quite wrong now the value estimate for tea leave with a white background will continue to become better and better as it observes it more and more but as I noted before it will neck the rat will actually never go back to pulling the leaf with the black background because we've estimated that now as minus 1 and because we've seen the cheese at least once for the leave with the white background the estimated reward were there will neck can never actually reach minus 1 so if you're greedy you'll just continue to select this leave with the black white black brown background even though in this case it's actually two suboptimal thing to do so the regret that this rat incurs turns out to be 0.1 0.1 point 6 times T which is the difference between the optimal thing and the thing you're actually selecting which agreed gross linear as a function of the time now this specific value is conditioned on these first two episodes being exactly what they are so the rat is conditioned on the rat being a little bit unlucky in these first two episodes in general you can reason about this in expectation so there is a nonzero probability that this happens so there's a nomes your probability that your regret revolves linearly with the time in some cases the rat will be lucky and the greedy action will actually lock into the actual optimal action so in some cases the regret won't grow but because it sometimes can grow with a nonzero probability there are D X the expected total regrets still grows linearly is that clear okay now we can talk about the regrets yet a little bit of different way which turns out to be useful by looking at what is the like what is the action regret and the action regrets is defined as just what is your regrets so the expected regrets when you take that action and will introduce a Delta notation for this so this Delta a is basically it's the action regret or the gap between the true value of that action and the optimal value this Delta is 0 for the Optima action and it will be positive for any action that is not optimal now we could using this notation we can rewrite the total regret over time so note that the first sum there on the left is overall time steps so we basically consider all of the actions that you've ever selected and we look at the true value of that action and we compare it to the optimal value and we just add to the regret whenever you select the action we add that the corresponding gap and we can rewrite this as a sum of actions by taking into account how often you selected an action this is quite clearly true you just take each action you get a number of times it has been selected up this time T and you just count the gaps that way so this second sum is over typically much smaller set it's just all three actions rather than over time but these are completely equivalent and then we can just write it down just using the notation defined up there this is just basically plugging plugging in the definition of what a gap is we can basically say our total regret can now be written as a sum over actions where we basically just say the number of times you select an action times what the regret will be for that action that's probably quite intuitive but it's good to be sure you're you're following but this allows us to reason about it intuitively as well so what would a good algorithm do it would ensure small counts for any action that has a high gap a high regret if an action has a particularly low gap below Delta so it's pretty close to the optimal action or it's even optimal then you don't particularly care that this number of times you selected it is small it can be fairly large especially for the optimal one you essentially want that to be large because then your total regret would be smaller and this turns out to be useful in the analysis as well because then you can reason about okay we want one of these two to be small at least both small is always good but if say the gap is very big then we definitely want n to be small and then the question is can we can we insure that somehow so again the actual regrets are not actually known so this is more for the analysis of the algorithm but first we consider a much simpler simpler case something we also considered in the first lecture we could do something just add a little bit of randomness we could add a little bit random as to the policy essentially by picking a random action uniformly random across all of the actions on each time step with a certain probability now this is actually maybe the most common exploration strategy in current day reinforcement learning also in big applications for instance this is this was used for the results of the Atari games for the dqn algorithm and it turns out to be quite effective even if it's maybe not the most sophisticated exploration method the way it works is very simple we call this epsilon greedy and it basically means you select the greedy action with one minus Epsilon and with probability Epsilon you select a random action the random action can also include the optimal action so essentially the sorry too greedy action so essentially the probability select ingredi actually slightly higher than one minus Epsilon because you also have to factor in that you can select it when you select randomly now question is is this enough and how do we pick the Epsilon so we're already mentioned that the greedy policy can lock into a suboptimal action forever and we showed this in the example with the rats now it turns out the Epsilon greedy algorithm it continues to explore if you have a fixed Epsilon which essentially means it will always eventually find out everything there is to know about all reactions all of your action values will become accurate in the long run which is good so it will notice at some point it will basically learn which action value is truly higher than the other ones however if your absolute true is really constants this also means you have a linear expected regret simply because there's this probability and it continues to be the same probability in this setting that you'll select suboptimal actions so again the total regret hair grows linearly similar to how I did in a greedy case doesn't mean these algorithms are in some sense the same or equally good or bad I would definitely say this is better than greedy in a typical case because you do learn event you'll basically select the optimal action with probability 1 minus epsilon which is not guaranteed for the greedy case but the regret still grows linearly so you still lose something and it seems at some point unnecessary if you really know the true values of all the actions because you've selected all of them so many times do you really need to continue to explore you really need to try that action that you're pretty certain is not good the answer is no you don't have to and you can do better so sometime a couple of decades ago people were investigating this problem and they were reasoning about what is the best possible thing you could hope to get and turns out you can say something about that and turns out it's it's basically related to those action regrets we talked about before it's related to the similarity in a sense between the optimal action and all the other actions and then it turns out the harder problems are the ones where these actions have similar distributions but different means the reason being if these distributions for the rewards are similar for different actions it's hard to distinguish them and therefore it takes longer to basically find out which is the true optimal action you didn't describe this formally with something that's called KL divergence if you don't know what it is that's fine I think it was in the entry quiz but it's basically just a similarity between the distributions and then lion robins proves that the total regrets asymptotically so when T goes into the limit will be larger than this quantity there at the right hand side now this quantity has a summation over the actions with all the action gaps in there and it divides by the similarity by the distributions I would say all of that it's not that important right now the more important bit is the log T this basically means that the total expected regrets that you will incur for any algorithm will grow on the order of logarithm of time now this means that the regret will grow unbounded which is expected because you never read you're never really sure so you maybe have to continue to explore but it grows a whole lot slower than linear logarithm of T is a whole lot smaller for large T than T itself so then the question is can we find an algorithm that actually change this bound can we find something that that gets close to this because this is a lower bound this doesn't say which algorithm to run it basically says for any algorithm you will at least get this regret so now before we get to accrete algorithm that might attain that let's talk about the intuition behind what you could potentially do and I've talked already a little bit about uncertainty now let's assume we have some measure of uncertainty these are three different actions and we're fairly certain about the highly peaked one in red we're less certain about the one in the middle there in blue and we're really uncertain about the green one there at the bottom I didn't specify how to get these uncertainties I'll talk about that later but just for now assume that we have these which actually should we then pick and the argument here is that if you're more uncertain about the value of an action maybe it's more important to explore that action but maybe at first let's zoom in a little bit maybe at first we'll just be a little bit greedy and we'll try the action with the highest mean because we do think it looked quite promising and I mean there's other actions that might be up to all but this one has a good shot of being optimal as well so we select this and let's say that we get a reward that is a little bit on the low side of what you might expect and then we might update a distribution not saying how just this is just an intuition right I'll just say maybe we'll update our distribution a little bit towards that maybe we'll shrink it a little bit we're a little bit more certain of the more and more samples we get that the actual mean is there but this means we're a new situation now where the red has shifted a little bit and maybe maybe not looks more promising to select the green one for once because there is quite a bit of probability mass there on the right hand side which means that the green action might actually be optimal like the true value might be 4 or even higher for the green action for the red action that's very very unlikely right now so select the Rhian action will observe something which in this case actually is indeed higher than the mean of the green action so maybe we were under estimating the value of the green action a little bit and then we can shift the distribution now this is just to give you a little bit of an intuition of what you could do but what we were actually doing there is being a little bit optimistic about which action might still be possible especially in the second part or a selecting green action and one way to do that is to basically define an exploration bonus which we call upper confidence in this case for each action value and we'll do that so that the true value with a high probability is smaller than your current estimate plus that bonus so you'll take your current estimate you'll acknowledge that we are uncertain whether this estimate is correct and we'll add a bonus so that we're pretty certain that at least the values below what you didn't get so if your true estimate is say zero you might add a bonus we'll say ten and then basically you claim here is okay I don't know what the actual true value is but I'm pretty certain it's less than ten and then we could have an algorithm I'll define this bonus in a moment but we could have an algorithm that just greedily selects among the actions but using not the estimate skew itself but the estimates plus the bonus now the intuition here is that this bonus should be bigger for actions that were more uncertain about and especially in the simple bandit case we can do that by just keeping track of the number of times an action was selected and we can basically intuit that if we haven't selected enough action often so if this n is small then probably our bonus should be big we haven't selected it often so we want to add a big bonus so that we're pretty sure that the true value still below there but if we've selected an actually many many many many many times we don't want to add a big bonus because this will be overly uncertain we're not actually that uncertain about value anymore so small bonus is enough what this means then is that we'll select an action even if it's estimate right now is low just because we haven't selected it often and this is especially true if you think about if there's a if there's an estimate at some point in an estimate of a different action which is higher you might select it higher it valued actually quite quite often greedily but at some point the bonus will shrink for that action enough so that the bones of the other action will over overtake it and you'll select the other action just to try it out the exploration now for normal flat averages as we've discussed before the uncertainty about where the true value is typically decreases as the square root of the number of times you selected an action just this is just using the central limit theorem and small potential caveat this assumes that the variance of the reward is bounded this is typically the case but there are distributions like the Koshi where the variance is actually unbounded and in this case you basically have no hope but that's very rare so using this intuition can we derive an optimal algorithm so the algorithm idea is as follows recall that we want to minimize the total regret which I've written down here as number of times you selected an action times how bad was it the selected action that's the Delta yes yes yes yes there's a very good question so at T equals zero what should this upper bound be what is typically done is to say your upper bound is basically infinite for each action at T equals zero which in practice means you're going to select all the reactions once and only then this kicks in because then you'll have an estimate for each of them it's very good question it's also related to the assignment because you're going to have to implement something like this so rather than implementing infinities it's better just to select each of them one time let's say okay so the idea of the algorithm will be we want to minimize the product of how often you select an action times the the regret of that action so if the gap is big if this action is actually really quite bad compared to the optimal action then we want the number of times we select it to be small on the other hand if the number of times you selected an action is big we want this gap to be small we want the algorithm to have these properties now not all n can be small because they must sum to the total time you must select an action on every time step so the sum of all the number of times you selected an action must grow over time so all we can hope to do is to more often select the actions with a low gap than the actions with high gaps okay so in order to reason about this and this is what's used for the analysis of these algorithms there's something called huff dings inequality some of you might know this but if you don't this is something of a more general family of results which are called concentration bounds and what it means is we can say something about how this estimate behaves without characterizing the whole distribution and in this case what we're going to assume is that we have some random variables think of these as your rewards and we're assuming they're bounded you can also there's different qualities you could use when you don't assume they're bound about say the variance is bounded but in this case we're just going to assume the rewards are bounded we know the rewards are rolls between say 0 & 1 don't have to be between 0 & 1 you could have different bounds and typically we have but just for the simplicity of the theorem we'll say between 0 & 1 and let's say we averaged these as we were doing before so our current estimates for the expected rewards just the average reward you've seen so far then turns out we can say something something which is that the expected value of these random variables these are iid so we can just pick any one of these the probability that that thing is bigger than the the estimate you have right now plus some bonus U is bounded by this quantity on the right hand side now what does this mean it basically means that if you've selected a number of points and your average these is it an add bonus that you can bound the probability that that thing that you got there this mean Plus you that this thing is is still an underestimate and the bigger you pick you the smaller this probability will be if you pick a really big value add it to your mean right in this case it's it's almost trivially true during the the rewards are bounded so if you would add 1 you would be a hundred percent sure in some sense that you'd be overestimating but you can bound this thing with this function on the left which is now a function of the number of times you've selected you've sampled this random variable N and this gap and note that it decreases for both so the more often you select something if you consider the same bonus if you select it more and more often it becomes less and less likely that you're going to be sufficiently far off what this essentially means this will be fairly close within you of the actual true value after a while if you consider a bigger gap so if you consider a bigger discrepancy between your current estimate and the true value this will also become smaller so the higher you pick your bonus the less likely it is that you're going to offer estimate now the idea is to apply this inequality which is true for basically in general under this under the assumptions of the theorem to the to the bandit case with bounded rewards so for instance like I said if you pick the rewards will be between 0 & 1 then you can bound how far off is my estimate so the estimate there is the big cutie and we add some bonus which I haven't defined yet but let's just consider some bonus then we can bound the probability that this whole thing is still too low now how do you use this and to define an algorithm let's say we want to pick certain probability we want to pick a probability that the true value exceeds the upper confidence bounds that's what we call this estimate plus your bonus an upper confidence bound so now we can basically solve we can invert that we had this quantity on the right hand side of the previous slide and have things in equality and we just say let's pick a P and then we can solve for an upper confidence bounds and it turns out to be this quantity the square root of the minus log P divided by 2 times the number of times you selected that action now and this is just an intuitive idea I'm not actually deriving the I'm just giving you an intuition let's say we want to pick that P in such a way that it decreases over time so what does that does that mean we basically wants the probability that we're going to that we're going to be too low with our with our bound we want that to decrease over time for instance as 1 over T if you do that and you plugged it into this bound over here you get something down there which is the square root of log T divided by 2 times the number of times you selected that action so turns out this 2 is not that important the main things are the log log T and the number of times you selected action and that both of these are in the square root now what this is what does this do picking the exploration in such a way so because the the the probability that were wrong essentially that our estimate plus bound is still an underestimate of the true value it decreases over time but it doesn't actually go to 0 and this means that will continue to explore indefinitely but eventually we will will almost lock into the optimal action will select it more and more often as time goes by because our estimates get more and more accurate and we're more and more certain that the the estimates are very close to the true value so this leads to concrete algorithm where we've now defined this bonus and like I said the 2 there wasn't to importance and I basically pulled it out and I made it into a parameter C so this is the bones that we're adding to our estimates we have an estimate Q and we're going to add something that is the square root of the log of T your time step divided by the number of times you selected this specific action so the quantity R over the bar doesn't depend on action this will just grow but slowly logarithmic ly over time so what does this mean in practice it means that for this action let's consider an action that you haven't selected in a long long while it means that this bonus will keep on growing because of the log T and if you never select it for for a very long time this bonus will grow and grow and grow until it goes higher than all of the other estimate plus bonuses for all the other actions at that point you'll select it when you select it the number of times you selected this action will go up which means the bonus drops now at the same time you've got a new sample so your Pugh your estimate for this Val for this action will also change it might go up might go down maybe you were under estimating the value of this action and maybe this value goes up so maybe it actually becomes more likely that your select is action again in the future or maybe your estimate was pretty accurate and you get a reward which is very much in line with the estimate you already have in which case it doesn't change much and then the bound the bound which has gone up slowly with the logarithm and then gone down when you selected it basically will ensure that it again takes quite a long time before you select it again so what the log T essentially does it is it bubbles up all of the action probabilities each of these actions basically will get selected indefinitely again because this bound keeps on growing but whenever you select it you kind of whack it back down again and then only if your actual estimate is high will you continue to select this action this is an important property so thank you when will you select an action either when you're uncertain n is small relative to the other actions or the estimated value is high and this is a general property we basically want to select the actions that we know are good with pretty high probability or that we're very uncertain about and this algorithm has exactly that property now turns out you can analyze this and you can consider what this algorithm actually does and Peter our did this 2002 and he proved a certain specific bound on the only regret now previous bout we discussed was a lower bound on a regret this is an upper bound on the regret for this specific algorithm and the upper bound interestingly is also of order log T which means that as far as the dependence on T is concerned this is optimal you cannot do better than log T that's what the other bound proved and this algorithm actually attains that now in practice this C quantity can be considered a hyper parameter and it basically ties into what is the shape your distribution how do these distributions of the reward actually look and sometimes it's better to better just to tune this a little bit and try to see if for your specific problem you can find a better value than the default value of say square root of 2 that's Peter our suggested ya one way to think about it is that this C corresponds to do you consider like a 95% confidence or a 99% confidence it changes over time because of the way the bound is setup the actual confidence but it's related to that yes so typical values for this are 1/2 1/2 around that it's a little bit hard to intuit exactly what it means but if you just play with it you'll find that for certain problems you'll get certain certain values just perform better and it's good to start this around this expiration around say 1 yeah yes yeah oh so this is just this is a bound right sorry in in this setting it's a bound so what we're basically saying there's a certain probability that you'll be far off and for each action there is such a probability this means that the probability distribution that we're considering right what what does P mean here it's either is your bound and over estimate is your estimate plus bound sorry is that an over estimate of the true value or is it an underestimate so there's only these two possibilities and essentially what we're saying is well we want there to be a probability that the true value I mean there's always a probability nonzero probability that your true value doesn't lie within your bound right you have a certain confidence interval in some sense and basically these reasons about does your true value fall within that confidence interval or doesn't it and we want that probability to have a certain property we want that to be sufficiently we want this bound to be sufficiently wide that we're pretty certain that that that the bonus that we're giving is meaningful in the sense that we're beyond the true value but we also want it to be we don't want to be overly uncertain more uncertain than we need to be so we want this probability to be not too high either or too low so there's all these two choices whether you're below or where the actual value is below or above the estimate plus your bonus so in that sense it's properly normalized yes it doesn't actually so what the more general statement of halflings inequality basically says there's bounds a and B and then the a and B just show up in the equation below I just didn't put them here for simplicity this is like the simplest statement of the of the theorem but it easily generalizes to other bounds and there are similar concentration inequalities for other cases for a for instance your your random variables aren't bounded but they have finite variance and you know that the variances is bounded and then you can do something similar you could have got get a similar statement but this is just the simplest simplest one in a sense hmm so we typically don't change see over time so what we did here is we pick the probability which we do change over time but that may the same she turns into a constant C because the rest already changed over time the change over time is captured in this log T and in the N but the C parameter we typically just pick one and we stick with it so you don't have to decay see this is different from when you do say epsilon greedy so what I mentioned is that if you have a constant epsilon so if you have a constant random expiration throughout learning that this is maybe not a good idea because you continue to select things that eventually you know are not like good you could also consider decaying this probability over time and people often do but then you have to pick a decay schedule and you have to be careful how you pick that turns out if you do you can actually also get logarithmic regrets if you're careful on how you pick your epsilon but I won't cover it in this lecture but this algorithm kind of automatically does that because of the log T and the division by the number of times you select an action and then the C is just just regulates how quickly you learn but essentially for any CEO eventually get there and but the constants in your in how quickly you do might change so you might get there slower and I'll have an example of that a little bit later I think so this algorithm is not that hard to implement you only need the number of steps you've had so far you need to keep cut a track of counts for all of the actions I need to keep track of the average reward for all of the actions and then interestingly the algorithm picks deterministically it always there who is one action let's assume that the arc max doesn't break ties randomly but maybe if it just picks the first one then it would determine asleep pick an action in each step which is in some sense deterministically the optimal choice given all the information you have so far for the exploration exploration trade-off there was a question back yeah yes so the lower bound which has showed before that was a very general statement about any algorithm that you can't do better than a logarithm regrets I won't go into detail how they arrived but the upper bound here essentially what you can do is you can consider this algorithm and then you can just plug that into half things in equality and what you can then do is you can use that to reason about what's how n changes over time the number of times you select an action and turns out the number of times you select an action if you use this algorithm depends on how suboptimal that action is so it depends on this action regret and it turns out it depends in such a way that the product of these things is always at most logarithmic in T for all the actions and this is the case either because the the gap is very small for instance the optimal action has a gap of zero so you don't mind selecting that one indefinitely often but if the gap is very big turns out the number of times you select that action is relatively small it'll only be role algorithmic in the number of times you selected any action at all and this turns out to be true done for all the actions with this specific choice of a bound if you want to actually see the proof which is quite it's not actually that complex if you have half things in equality you can go to that paper by a Peter our on the UCB and he just proves exactly this thing there at the bottom using the hospital in equality it's a bit technical there's a lot of steps involved which is why I didn't want to put it in the slides but given half things in equality it's not that hard to step through and there's been many follow-ups to this where with different assumptions also going back to that question about why 0 to 1 or you can make many other assumptions about the distributions and for specific assumptions you can prove like slightly stronger things it never goes better than log logarithmic in time and unless you make very strong assumptions about your problem but that's that's like this was this was a landmark result because it was the first algorithm that actually achieves that bound so in the in the previous lecture I've talked about an agent codes represents internally you could represent values and then use these values to come up with actions and then maybe this is a good way to to find good policies I also talked about maybe you can put a represent these policies directly I'll talk about a little bit later and we also talked about maybe you can learn a model now learning a model in general in this course we won't touch upon that much also because it's quite it can be quite tricky to learn a true model especially of the dynamics of the world the way the world changed might be very hard and it might also be very hard to know to learn I mean it might also be very hard to know what to focus on but in this case because there is no state spacing there's no observations there's only these rewards learning a model is much simpler perhaps so what we could do we could have a reward model that basically predicts the reward for each action but that looks very similar to the value based algorithm in this case in fact you might these might be indistinguishable in some sense you could interpret in this case the action value estimates as a reward model because they're basically just a prediction of what the expected reward is when you select an action but it doesn't mean we've exhausted everything you could do with a model-based approach and in particular we could model more than just the expectation we could model for instance the distribution of rewards and this gives us Bayesian badness now the idea here is that you'll model a parameter in parametric distribution over the rewards so now there's this probability here which is a Bayesian probability so the way to an is maybe if you want as a belief that the true expected reward is at a certain point sorry I probably shouldn't have I should have probably put the other one here we're going to model the way the true value is not where the where the rewards are but we're going to model an expectation over we're going from all the distribution over where the true what the true value is and we are going to update this using the probability of the reward under your model so now theta will represent the parameters of your parametric model so for instance for a Gaussian model this could be your meaning your variance these two parameters defining Gaussian together for a multivariate Gaussian these might be the vectors or matrices but for a simple simple case it's just a mean and a variance is two numbers and this will characterize a Gaussian which could be a model for where you think the true value will lie now you can use Bayes rule to update these these distributions basically starting from a from a prior a zero we could update the model parameters each time we see your reward first there's an action and let's just keep you separate for each action so each time they're conditioned on the action so we'll just basically just have a separate model for each of the actions and we're just going to update that each time we see a data point yes yeah so I yeah we're trying to actually do that that's what I meant here when I said I should have put the other one what we're actually modeling is and it'll be in later slides as well PQ there on the left all the way basically the probability we believe the true value is somewhere sorry that I'll fix the using the data we're going to update the the probability distribution on where we think the true value is Q rather than RT we're not trying to match the distribution of the reward we're trying to learn a belief distribution over where the true value is so what does this give us well one thing it gives us it allows us to inject rich prior knowledge say that you have a Gaussian like I said these parameters theta might be your mean and your variance of your Gaussian you might pick this to be in a reasonable place you might pick the variance to be wide enough that you're pretty certain it lies within within this distribution but not too wide you might pick your mean to be somewhere maybe the mean the initial mean might be different for different actions maybe at some prior knowledge about this and this gives you one way to inject that and then we can also use these distributions to do the exploration and I'll discuss basically two different ways one is again upper confidence bounds but now using the explicit representation of these distributions and the other one is probability matching and Thompson sampling which falls under that header so let's consider an example to make things concrete let's consider a bandit with a Bernoulli reward distribution this simply means that the rewards are either plus 1 or 0 with an unknown probability so the expected reward here is exactly the probability that it's 1 and we want to basically model a belief over this probability now for instance we could pick a prior for each action to be uniform on 0 1 which basically means we don't have a specific preference we believe all of these probabilities are equally likely to be true in advance we believe each of these might happen and we don't want to pick one over the other we're basically not saying it's a higher probability that the true we're not saying that possibility of reward of plus 1 being say point 6 is any different from it being say point 2 now if you have a Bernoulli distribution on your random variables it's natural to model these probabilities and beta distributions and then the assumption that we're going to do a uniform distribution initially is equivalent to saying we have a beta distribution with parameters 1 and 1 if you don't know what a beta distribution is it's just a probability that I'll show in the next slide which with these two parameters that basically say how often have I seen it zero have never seen one and confusingly these things are all set by one so the uniform case if you want to say the uniform case basically means we have no knowledge this has two parameters one for each and then the way to update this posteriors is very simple because whenever your reward is zero you update one of these parameters whenever the reward is one you update the other one you're basically counting how often did I see is zero how often did I see a one for this specific action so that's a very simple update this is nice because in general Bayesian updates can be quite involved you might have to calculate integrals but in this case because we have a simple simple probability distribution with fixed one that's particularly easy to sit to work with then we can update it very simply by just updating the parameter of the distribution in this way now how does that look suppose we start all the way left there we have no information we make no judgment calls we say all of these true values are equally likely that's what the picture all the way on the Left means the y-axis here is the probability we assign the probability mass we assign to each of the true values and then on the x-axis we have each of these three values which spans from zero to one in this case we know the true values between 0 & 1 because we know the rewards are either 0 or 1 so the expected reward must be between 0 & 1 so there's no probability mass beyond 0 or 1 but between in this interval we make no judgments yes now suppose we see this sequence of rewards we get a plus 1 we get another plus 1 then we get a 0 and then we get another 0 all of this is for a single action just considering one action we're considering how to update this distribution now this is what will happen to the probability distribution which captures our belief about where the true value is at first you get a plus 1 so we see the distribution shifts to the 1 it doesn't go all the way there unlike the greedy case we're not committing completely we're not saying now it must be 1 but we're just saying the probability that it's closer to 1 has grown and the pro exactly zero is very small now it's zero at exactly zero but this is probably that it's close to zero it's also fairly small if we see another reward of plus one the distribution becomes even more skewed because now we have some evidence that it's more likely that the value is high then it is low but it's still fairly spread out there are still non-magical probability mass say on the values that are below 1/2 much smaller than on the values above 1/2 right now but are still quite some mass if we at that point see a reward of zero the distribution shifts we know no longer now the true value can be one so basically it falls down all the way to zero there at the end we also knew it couldn't be zero because we've seen rewards of plus we've seen rewards of zero and of plus one now and the mean is probably somewhere in the middle maybe a little bit still closer to one than it is to zero because we've seen more ones and zeros but if we see another reward of zero the distribution becomes symmetric you can see it see there's still quite a bit of uncertainty we don't commit yet strongly on where the actual true value will be we just know that it's not zero it's not one it's quite unlikely to be very close to either one of these but somewhere in the middle we'll really just don't know and that's what's captured in this probability distribution this is the beta distribution if you update the parameters as exactly shown on the previous slide now what could we then do let's say we have multiple distributions for different actions one thing we could do is very similar to what we were doing before with UCB except instead of just picking the bound which basically defines your confidence interval we could I just look at the distribution and define your confidence interval on that one so for instance a simple way to do that is you could define a an upper bound which is related to the standard deviation of your distribution that might be one way to do it especially if you if your distributions are all gaussians then basically the standard deviation is enough in a sense because you don't have to capture the shape which might be lopsided in general for gosh in a comedy ops sudden stand deviation is enough if you have a different distribution there's different ways to pick out these these values and then maybe you can do the same thing as we did before you pick the mean you add a bonus which is may be related to your uncertainty in this case on where the true value is and you pick really using that this is one algorithm that you could do but actually more common when we model this distribution is to do something that is called probability matching in probability matching we're going to do something that is maybe a little bit unintuitive or depending some people find it very intuitive but depending on how you look at it because we're going to select an action according to the probability that it is optimal so what this means is that that we're defining a policy and we're basically going to make this possible this policy equal to the probability that the true value is the maximum through value under the history here where I've basically abstracted away that this is now used this history to update these Bayesian distributions so sanction this probability here this is again a Bayesian probability right because in truth for each of the actions it is it is either the optimal action or it isn't but this probability distribution is under these Bayesian beliefs of the true where the true values lie and because you have all of these distributions not explicitly you can reason about this you can solve this you can find the probability that this action is the optimal action under the distributions that you've so far updated now it can be difficult in general to compute this analytically note again by the way that earn certain actions will have a higher probability of being max because there's a lot of probability mass may be on the till so you might be more likely to select that one which is indeed the property that we're after now because it might be unwieldy there's another thing we can do which is called function sampling and this is actually from if I say correctly out of out of top of my head is from 1933 so it's pretty old which is we're going to sample from each of these probabilities so remember we have these probabilities over the true values for each of the actions we're going to sample one for each of these actions independently and then we're just going to select greedily this means that if you're a very uncertain there's a pretty big possibility that you'll select something that's really high you're going to sample a candidate true value essentially that is really high there's also pretty big probability maybe that is really low so you won't select this action all the time but you'll select it every one so often so again either your uncertainty must be very big for you to be selected or your or your current mean value must be big that's another reason why you might be selected and so it turns out if you do that in expectation you're doing probability matching but because you need to set randomly selects in action anyway you can only select one action on each time step this is completely fine you know you're not losing anything by doing this in a sense now it turns out if you do this for say the Bernoulli bandit case that we discussed just now this also achieves the optimal lower bound on regret so this is in a sense an optimal thing to do now why is this a little bit surprising because we're selecting these actions according to the belief we have that their true value is the optimal value but that doesn't mean that we're taking into account explicitly how to regret changes over time or explicitly how much we learn from picking this action so in sometimes it's a little bit surprising that this algorithm works so well and in practices also works quite well but it does require you to get these posterior distributions from somewhere which might be a lot of overhead in complex cases might be hard to do in general but if you can do it for instance for the Bernoulli bandits there it's quite easy and then you can use Thompson sampling and it would be just as good as you should be in a sense okay so what does Thompson sampling it's like I said it's a little bit surprising that it works well because what does it doesn't on something not explicitly reason about is the value of information we could go on further and we could reason about what the learning process is because we know what the learning process is we know what we're going to do we're going to get a reward we're going to update certain internal estimates of the agents whether it be a model or a value and we can reason about what we learn from each in interaction with the world so potentially we could reason all the way through this taking into account the knowledge we have so far so this is sometimes called the value of information we can reason that we can plan through this whole three of future potential lifetimes that you might go through and we can maybe plan somehow optimally through this taking into account the assumptions we can make about the problem and maybe we could quantify how useful it is to select the certain action because then we can plan through and find out how much did we learn from this action potentially depending on the outcome and how would it change our later actions in terms of the learning process so for instance if you want to quantify the value of information it makes sense that we would gain more information when you try something that you're uncertain about if you're already very certain say about a certain action value then you basically ain't no information if you select that action again you already knew so this isn't sometimes not the optimal thing to do so according to the value of information it might make more sense to try uncertain things more and if we know the value of information then we can trade up exploration and exploitation again if we can quantify these values somehow in terms of the future expected reward so what is the value of information essentially the value of information is how much can I gain later if I gather this information now I'll try something that is insert and this will give me some knowledge and I might exploit that later and I can reason about how much I can exploit that later how much it gains and so what this means is so far we've viewed Bandits as a one-step decision-making problem every episode lasts for exactly one step and there's no interaction between the episodes but we could also view it as a sequential decision making problem by reasoning all the way across this future that might it might happen taking into account our own learning process so now rather than trying to plan through what the environment might do we're planning through what might happen inside the brain of the agents in a sense so how do you do this then well you would have to define an information stage summarizing all the information accumulated so far and then for each action we would call the transition to a new information state by adding certain information with probability that you can define so if you have a probability model if you have a model on your rewards say you can reason through without actually executing an action for each action you can say if I would select that action I believe it's doesn't just likely that I'll get a plus one or a doesn't as likely that I get a minus one and then I can reason through the resulting States internally in the agents and I can reason through what the agent in both of those cases would then do so in both of those cases you might select a different action depending on where they're also plus 1 or a minus 1 or plus 1 and a 0 or a 0 in each of these cases we can consider what the agent would do next and again you might select the different action end which might again give you a plus 1 or a 0 in either of these cases so you get this expanding tree in addition to the random outcome of the actions you can reason through how likely am I to select each action if you're using a probabilistic method if you use UCB the actual action selection is deterministic but the outcomes are still random so you will still have an expanding tree of possibilities but if your policy is also stochastic as it will be for Thompson sampling then in addition there is an expansion for each step which is related to how many actions there are so this becomes a huge tree at some point and it's very hard to reason through but if you might if you make if you can make certain assumptions you can reason through this whole thing so essentially what we have here is a Markov decision process which may or may not be known and essentially for each internal model you would have a different mark of decision process for each internal way you select your actions you would have a different interest decision process in a sense and then you can reason through this you can basically try to solve this thing so now even in bandits previous actions select the effective future not because they change anything in the world not because it changed the state of the environment but because they change your knowledge now for instance for even a Bernoulli bandit I use mu here to basically define the probability this is your mean of the Bernoulli distribution which in this case is also just your your value then the probability of getting a 0 is 1 minus mu so for instance this might be winning or losing a game with this probability then the information state might just be how many times for each action did I get a 0 how many times did I get a 1 so what we're putting in there is prior knowledge is essentially that we know that the rewards are 0 R 1 and then we can reason all the way through this now we can formulate this as I said as an infinite MVP that goes infinitely into the future where we reason about if I do this then this might happen and then I can reason about what happens next because then this change is my mind for the next step and so on but you could also solve this with model free reinforcement training you don't have to plan through everything you could also just sample and reason through it that way this has been done in the past or you could use model base 3 and 4 is returning it's a reason through how these states these information states change over time this becomes unwieldy because the tree can be quite big but we'll talk about in the next lecture about things like dynamic programming techniques where you don't actually build the whole tree but you do this step by step and this can be more efficient so the latter approach where you do the model base thing is known as Bayes adaptive reinforcement learning and of course you still need to define a priority distribution you need to put in your prior knowledge that you have about a problem what do you assume at the beginning about where the true values lie and your solution will be conditioned on that but then if you pick a right prior this can be quite quite good in some sense optimal given your prior but of course it can be very unwieldy and it's on clear how the skilled is especially later when we'll consider problems which are not bandit problems where there is now also an environment state and there are observations so this we won't go much more in-depth into this but the main thing to take from this is that you can define is you can reason through your learning process you could if you wanted to whether it's a good idea that's more of an open question okay so now I want to change gears again and I want to talk about we talked about values we're talking about models now I want to talk about policies and this will be important for several reasons one is it'll show up in the assignments which may or may not be important to you another one is that this is an approach that does scale whereas the previous one with the reasoning through your whole learning lifetime might not scale as easily this is something that you can apply to problems at scale and has been applied to problems that skill so what is the idea we want to learn a policy directly we're not going to learn a value function necessarily we're not going to learn a model we're just going to parameterize a policy we're going to try to learn that and then the question is how can you do that so for instance we can define a probability distribution on the actions as such this is called a soft Max or sometimes Boltzmann distribution the lower part here is just a normalizer and it basically means that your probability of selecting an action is now proportional to the exponentiate preference these H I don't particularly like the letter H for this because we've also uses for history but it's the letter that's used in certain umberto so I wanted to be consistent with that this just means preference it doesn't mean value these HS don't necessarily have to correspond or converge to the value of selecting direction it's just a number that says it tells you this is how much you prefer this action to the other so we've now parameterize the policy by just assigning a number to each of the actions and then defining a policy like this this is sarcastic policy this means that the probability of selecting a certain action is between 0 and 1 if the preference of one action is much higher than for all the other actions the probability for selecting that action then in the limit goes to 1 and the probability of selecting the others goes to 0 we're going to view these preference basically as learner more parameters and then we're going to the reason about how can we optimize them yes so the question was whether these preferences are whether they have the same relative relationship to each other as values would have and the answer is no for instance one one way to think about that is you might you might know that a certain action needs to be the optimal action there's multiple ways to do that here but the only thing you need is that the preference for that action is much much higher than done for all the other ones and especially all the other preferences the relative ordering doesn't even have to be the same as for the values all that we need is that their preferences are much lower than for this one other action so even the rather like even the ordering of the preferences might be different this would be slightly wrong but if one action like completely dominates because the preference is much much higher it wouldn't actually show up in your policy and the policy doesn't care one other things maybe note is that you can actually add anything to these preferences in this case and it would disappear from the probability you can you can shift them up and down and it would disappear from the probability it's fairly easy to show that one showed here but it again shows that these things don't have the semantics of a value function yeah sorry what so do you mean how can we learn the preferences or how what do they mean also so what influences the preference of an action if it's not the value so I'm going to show you an algorithm that updates these preferences directly and basically what I'm saying here is that it will not be equivalent to learning the values these preference won't go necessarily to the values we won't even learn the values but what we will do is we're going to update these preference so that the resulting policy will attain higher value so in some sense the values to influence this influence these preferences in the sense that we're going to sample rewards and these rewards will be higher low and the preferences for certain taxes might go up or down depending on the rewards so definitely the reward still influence your your preferences and the true value still influences your preference in the sense that for instance eventually we want the preference for the optimal action which is the action with the highest value to also become the highest preference but there's no direct tie in the sense that we can write down these preferences as a simple function of the value it also depends on the algorithm that you run to update these preferences here I'm just parameterizing right I haven't yet given you an algorithm to update these preferences I'm just saying this is one way you could write down a policy a parametric policy this is one way to parameterize that where we just parameterize it through these preferences and then a question is this is basically a concrete in Senshi a ssin of a parametric policy the question is how can we learn these things in general and for this one specifically I'll talk about both but first I'll talk about the general idea so the idea is to update the policy preference sorry the policy parameters which might for instance be those preferences so that the expected value increases for instance we can consider gradient descent because we have a lot of knowledge about how to do gradient descent on things this seems like a valid and good algorithm to try and what will you then think you want to do we want to do gradient as cents on the expected value we're doing a sense rather than descends because we're not reducing a loss we're increasing a value so in the bandit case this is the algorithm that we want we have some policy parameters which I've now called theta just think of this as a vector which for instance might just be all your action preferences in the case that we discussed before but I'm talking about a slightly more general case here then we're going to add something to these parameters which is a step size times the gradient of the expected reward conditional knows parameters now I didn't put that here but these parameters of course somehow influence your policy and the previous slide gave you a concrete example of that but here I just wrote it down in a generic way where we basically say this expectation of the reward is conditioned on these policy parameters because these policy parameters define your policy and therefore the expectation changes if you change these parameters so these are the policy parameters that we're going to want to update but now the question only becomes how can we compute this gradient or is it even possible because we now have a gradient of an expectation but we don't lower the expectation we don't have that in general so now here's an important trick and I'll show it again in a later lecture as well but I'll show it here for the MANET case which is sometimes called the log-likelihood trick and in reinforced learning it's sometimes also known as the reinforced trick because Ron Williams had a paper that that that used this to do policy gradients and I'll step through this because it's kind of it's it's kind of cool for one and it's important to understand what's happening so what I'm doing here is I take that thing that we want it's all the way there at the left top left it's the gradient with respect to the expectation of the reward and I'm just going to write this out first thing I do I'll be explicit about the policy why is this an expectation well it's an expectation because of two reasons we have a stochastic policy and then this will give us an action and then given that action the reward itself might be stochastic so first we'll pull out the action and we're basically writing down the policy there which says okay there's a probability of selecting this action and then what remains is the probability of the reward conditional net action now note that this no longer depends on your policy parameters anymore because we've pulled out all the dependency of the policy parameters by being explicit about the probability of selecting the action also know that this thing you expect a reward conditioned on an action it's just the true action value so we can write that down as Q this is just plugging in that definition thank you another thing that we've done here is pulled inside the gradient signal to nab the nabla because Q as I said doesn't depend on your policy this is already conditioned on your action so if you know your action the expected reward no longer depends on your policy because we're doing the bandit case there's no sequential structure right we're just saying for this action this is the expected reward now on the next line we're doing something a little bit weird we're multiplying with one essentially we're multiplying with the probability of the action divided by the probability of the action of course this is valid because this is one what we're doing this because we can then rewrite and the way we'll rewrite this will pull the probability of selecting the action all the way in front again and the reason to do this is because then this sum becomes an expectation again so what I've done I've pulled the division out and I put it near the gradient of this policy and I pulled the probability of the action all the way to the front which then means by definition we can write it again as the expectation there at the bottom one other thing I did down there which you can do I could have just kept Q there because the expectation of the reward this expectation is over everything again this expectation of the reward sorry I should have conditioned this on theta by the way this is not an expectation depending on your policy parameters but the expectation of reward is the same so expectation of the true value by definition so I wrote the reward again rather than the true value and otherwise we get rid of this policy probability at the beginning because this is now defined defining the expectation where its expectation over your policy and the other part remains so the part that remains is the gradients of your policy divided by the probability of selecting that action now turns out and this is the final line here you can write down that this is a general truth the gradient of something like the derivative of something divided by that something itself is the same as the derivative of the logarithm of that something that's just by definition the gradient of the logarithm is so if you have log X say you take the derivative with respect to X this is 1 over X and therefore this is an equality there at the end and people typically write it down this way with the gradient of the log probabilities rich in his book prefers the other way where he keeps these explicit division by policy but these are the same so why are we going through these oops why are we doing this does anybody have an idea yeah exactly now we have an expectation on the outside very good so now we can sample this thing we couldn't sample before because the expectation wasn't all the way at the outside we had a gradient of an expectation and sampling the thing inside that doesn't even make sense we could sample the reward but what is the gradient of the policy parameters with respect to the reward that's not a thing the reward is just a scalar so that didn't make sense but now that we're all the way at the end we have an expectation around something so we can just sample that thing inside and this will be an unbiased estimate for the thing we all the way all had all the way at the beginning so this is pretty cool now we can get an unbiased sample for the gradient of the value so this is just summarizing the result that we had on the previous slide the gradient of the expected reward is equal to the expectation of the reward times the gradient of the log probability of selecting the action and we can unsampled it'ss turn it into an algorithm so the gradient descent algorithm simply becomes this where we update the policy parameters using step size times the sample three Ward times the gradient with respect to the policy parameters of the log probability of selecting the action that you did it's important that this is the action that you actually took because the reward is for that action right you sampled a certain reward for that action and this is now stochastic gradient ascent the true value of the policy I was talking about gradient descent first this is the stochastic version there off and we know in practice we often do so cast a gradient descent and this works just fine in a sense you just have to make sure that their steps don't are too big and in the end you'll find something where this gradient is zero hopefully and in simple cases if your function is convex say or in this case we're doing SN so concave then you'll find an optimum in general you might not find an optimum I get stuck somewhere in the slope optimum but still this is the nature of gradient descent and ascent so that can't really be prevented but you get all the benefits of gradient descent and particularly I'm going to come back to that later what's nice about this is that your policy can now be in a very complex thing it can be a deep network because we know how to do gradients on deep networks and that's something that you couldn't get with all the more involved algorithms we've covered before note that we don't need a value function estimate we can just use the sample rewards so we don't have necessarily have an explicit notion of value we just have a parametric policy now can we can instantiate that to make it maybe a little bit more clear let's go back to the case where we have a specific policy namely one that is parameterize with these preferences and then we can look at this algorithm and see what it looks like now turns out for the softmax and i encourage you to try to derive this yourself I think rituals has a derivation in the book it turns a little look like this the partial derivative of the log probability of selecting an action with respects to the preference of a potentially different action note 80 and a these are not necessarily the same in all turn out that this thing becomes the indicator whether a is 80 minus the probability of selecting that action a now this is a little bit opaque I find so I find it much clearer if I write it out like this at the bottom essentially what it means is that you you've selected a certain action and what you'll do is your update all the preferences interestingly this is because your policy is just parameterize with these preferences and one way to make this action for instance let's say you want to make this action more likely there's two ways to do that you could either increase the preference for this action or you could decrease the preference for all the other actions turns out in general the algorithm does a little bit of both it pushes down the other actions and it pushes up this action this again shows the seaport these preferences are values you've learned nothing about the value of the other actions but you can learn yet you prefer them a little bit more or a little bit less and it turns out the way it works is that you're going to update your preference depending on the reward so let's consider a simple case let's say your rewards are either plus 1 or minus 1 you've selected a certain action let's say the reward was plus 1 what will happen here is the preference for that action will increase and it will increase proportional to how big so basically what mine is how likely were the selected action if you were already going to selected action with probability pretty much 1 there will be little change to your preference it was a good idea you had a reward of +1 but you're already selecting it all the time you won't change it much if your your pervasive selected action was very low you might make a bigger adjustment to your preference because you saw something good it wasn't likely that you're going to select this action but there's something good emerged so you're going to update it quite a bit the other preferences if you get a reward of +1 they basically go down that again the rewards here are plus 1 or minus 1 so if you select something and it was a good idea you might want to select the other one's a little bit less if conversely the reward was minus 1 then the preference for the action you selected will go down and yeah a preference for the actions you didn't select will go up in total this will mean that you're going to select actions that were a good idea more often and actually that were a bad idea less often this gives you a nice intuitive algorithm but remember this is just derived right we've parameterize the the policy in a certain way but we're just doing grading the central value there's an intuitive algorithm here that tells you all you shift your preferences this way or that way depending on the outcome but it would just wanna merge from doing the derivation of the algorithm there are slightly less intuitive cases for instance if all of your rewards are strictly positive what this will do is it will always push up the preference of the action you've selected by nature of the updates however it will push up the actions of the Preferences of the actions that you've sorry the actions with a high value it will push up more than the preference of the action with a low value and then in total it again will come to the conclusion that it will only select in the end the action with the highest value even though it pushes all the preferences up so sometimes this is a bit less intuitive but it's still the same same result in the end it just shifts your preferences okay this is clear yes so if I paraphrase correctly hurrah so the way you would implement this if you have say you want to solve a very hard problem so maybe there's pixels coming at you it's hard to say what is actually need needed to be captured about this problem so what we want to use is a deep neural network and just learn the structure and then one way to set it up is that the input is just your pixels you go through several layers and at the output you have a vector the semantics of this vector is then the preference for each of the actions and then you can use exactly this algorithm to learn these preferences the gradient will flow all the way into the network because right now we're we're parameterizing with these preferences themselves but the general statement with this like the full weight vector of your neural network all of the weights of your neural network you can update all of them all right so you can just push a gradient into your network with backprop update all the parameters appropriately and it will slowly change these preferences to go towards higher rewards that's exactly what happens and this is not just a theoretical possibility this is what's actually used a lot in practice yeah very good question so how would you deal with continuous action spaces well the action preference version maybe doesn't apply as easily you can still define action preferences but you have to somehow define a probability space this is perfectly possible and you can but this thing also applies more generally in a sense so one thing you could do to give give to two concrete examples that I'll also get back to later in the course one is you could just define a Gaussian probability distribution which is still a stochastic policy which is still parametric and you could do exactly the same trick your gradients will look different for idea.the this this slide specifically is specific to the softmax preference version but we just need your policy to be some sarcastic policy that has a well-defined log pi for each action which it will because it's a probability distribution and then you could also apply this to say Gaussian actions in maybe a high dimensional space the other thing you can do is you can do something similar but different when you have a deterministic policy that just suggests one continuous action I won't explain exactly how that works but you can do that and there's also algorithms that can use gradients to update those it's a good question Thanks yeah and actually this has been applied to balancing pendulum things like that for those you typically need to continues actions you can do this with discretizing your action space and then use something like this but that doesn't make a lot of sense in some cases because you want you essentially want where you want to use continuous axes you want these also generalize across the action space and this turns out to be very useful but I'll talk about continues actions later in the course more we're almost running out of time so I'll quickly this is still important to cover note that the sum of the probabilities always one or the integral if it's continuous actions what this means is that you can consider any B which does not depend on the action and does not depend on your policy parameters and then the sum of B times the gradient of the problem of the often policy will be zero this is kind of a trivial proof just takes a couple of steps because we basically pull out that sum of probabilities which is guaranteed to be one therefore the gradients cannot move this sum so the gradient must be zero but what this means is important because what it means is we can actually add a baseline to the reward now one way to think about this baseline is the expected value across actions that's a common choice actually rich in chapter two he'll he'll propose to just use the average reward over all that you've seen so far as a baseline now what is the purpose of this baseline it doesn't change the expectation because of this treeline proof over here so the expected update is still gradient ascent on your value but what it does do it reduces the variance potentially so instead of updating your preferences let's go to the case where all your rewards are strictly positive instead of always updating your preferences up which will be otherwise the case what this algorithm will do it'll move them up if they're better than expected and it'll move them down if they're worse than expect this we're expected is now defined by this baseline his baseline is just something you estimate and it doesn't necessarily have to be that accurate because you're just using it to reduce the variance of your updates so it doesn't have to be the true value you're for your policy and this figure shows is from Chapter two so feel free to like look at it in more detail there this shows the effect that can have on the performance with a baseline it's much better than without the baseline for this specific setup the details of the setup are in Chapter two they're not on the slides so okay so I think we need to vacate the room so the one thing I wanted to say before we leave is so you can go to the slides there's like a small back to the example I hope I calculated this correctly but they gave the the rats the probabilities for each of these algorithms and I'll see you next Tuesday\n"