DeepMind x UCL RL Lecture Series - Deep Reinforcement Learning #2 [13_13]

**Learning Return Distributions: A Distinction from Expected Values**

In recent years, there has been a significant amount of research focused on learning return distributions in reinforcement learning. This is in contrast to traditional reinforcement learning methods that rely solely on expected values. While we can reuse algorithms and techniques from previous lectures by simply applying them to different humans, the problem of learning return distributions requires us to expand and extend our temporal difference algorithms in interesting ways.

**The Categorical DJIN Agent**

One approach that has been introduced is what is called the categorical DJIN agent. The objective of this agent is to learn a categorical approximation of the true return distribution. To do this, we need to define a fixed combination (combo) distribution to act as a support for expressing the categorical approximation of the returns. For instance, we might allow the return to assume any fixed value between -10 and +10.

**Defining the Support Distribution**

To learn these values, we use a neural network to output a vector of probabilities associated with each element of this support. This means that instead of learning an expected value as we would traditionally do, we are learning a suitable probability distribution over a fixed categorical support. It's essential to note that this is a strict generalization because we can still recover the expected value by computing the dot product between the fixed support distribution and the network probabilities through predictions.

**Extending Temporal Difference Algorithms**

Our temporal difference algorithms can actually be extended to this distributional setting in a relatively clean way. We consider a transition tuple consisting of a state `t` (or a word `rt`) plus one and the next state `t+1`. We predict the network's predictions in `t+1`, which will provide our bootstrapping target. However, we then need to shrink and shift this support distribution by the discount factor `Îł` and by the reward `r` at `t+2`.

**Projecting the New Support**

The issue arises when we try to update the probabilities to match the new support with the distribution in the previous state `t`. To do this, we need an additional step: projecting the new support onto the support that we are making predictions for. This essentially requires us to allocate probability mass from one distribution to another.

**Minimizing KL Divergence**

To update the probabilities, we can minimize the KL (Kullback-Leibler) divergence between the predicted distribution in `t+1` and the shrink-and-shifted distribution in `t`. By minimizing this loss function, we can adjust our predictions to better match the support distribution of the previous state.

**Recent Advances: Quantile Regression**

Recently, quantile regression has been proposed as a way to transpose the parameterization from categorical QMD. Instead of adjusting probabilities over fixed ports, we adjust the support associated with fixed probabilities. This approach provides several advantages, including the ability to move the support around to approximate the distribution as well as possible, without being constrained by an arbitrarily defined range.

**Conclusion**

The field of learning return distributions is a rapidly evolving area of research, with ongoing work focused on extending traditional temporal difference algorithms and exploring new representations and learning objectives. By leveraging techniques like categorical QMD and quantile regression, we can develop more effective reinforcement learning methods that better capture the uncertainty in return distributions.

"WEBVTTKind: captionsLanguage: enwelcome back to the second part of our introduction to the parallel in the first section we started by discussing how deep neural networks can be used for function approximation in rl and how automatic differentiation specifically can support doing so with relative ease and then we delved into the issues that arise when doing so when using deep learning for function approximation so for instance we discussed how the different choices that we make on the rail side affect the learning dynamics of approximate value functions through phenomena such like the deadly triad or literature propagation several of the issues that we discussed though were ultimately issues of inappropriate generalization so today i want to talk about some of the ideas that can be used to help with this problem by tackling directly the fundamental problem of representation learning in rl i want to stress that this is far from a solved problem so what i will discuss today is you can see it as a partial snapshot of recent research on this challenging issue but not an ultimate answer but the main insight that underlies many of the things that we will discuss today i think is quite important and is that so far our agents optimized the representation for a very narrow objective purely the prediction or maximization of a single scalar reward and this narrow objective is the only thing that is driving the entire representation learning in our d parallel agents and this has some advantages so the ability of building flexible rich representations that are kind of tailored to a specific task at hand is after all the main reason we use deep learning in the first place but it does come with some disadvantages because such a narrow objective can induce an overly specific overfitted state representation that might not support good generalization and this in turn can make agents even more susceptible to issues like the deadly triad if you agree with this premise then maybe the natural step is to ask our agents uh to learn about more than just a single task reward have them strive to build the richer knowledge about the world but of course this is you know in a sense it's easier said than done because we in order to do so we need to think of you know what other knowledge should they learn about and there are many many possible choices and since representation learning is not a problem that is exclusively url we can also tap into supervised learning literature for some inspiration but among the many possible ideas that can help build these representation i want to focus on two families of ideas that have attracted very very a lot of interest among our elder researchers in the past years and these are general value functions and distributional value predictions so let's start with the first general value functions if you recall from the beginning of this course rl is based on the so-called reward hypothesis that states that any goal can be represented as the maximization of a suitable scalar award and this hypothesis was originally discussed to argue that maybe arel as a whole is a sufficiently formalism for intelligent gold-oriented behavior but today i want to use this hypothesis to make a different point and argue that if this is the case maybe then all useful knowledge that agents should or could collect in order to support learning can also take the form of predictions about suitable cumulative scalar signals but importantly this predictive knowledge does not need to refer to the single main task circle reward instead the agent could make predictions about many different other scalar quantities and these predictions would still look very much like value functions but it would refer to a different scalar signal and are therefore typically called general value functions the general value function is in a sense very much like a standard value function in that it is a prediction about the expected cumulative discounted sum of a suitable scalar signal under a given policy but in general many functions we make explicit this dependency on the scalar signal c and discount factor gamma and the policy pi because we are open to make different choices for each of these so in the gvf language the scalar signal c that we choose to predict will be called the cumulant while the discount gamma associated with etf we'll still define a horizon for the predictions that we make about the death cumulant and the target policy pi is going to be an arbitrary behavior under which we compute expectations this will not necessarily be the agent policy of course this may still feel a bit abstract so let's be even more concrete let's let's consider some examples so if the cumulant is the main task reward and we compute the expected discounted cumulative sum of this signal under the agent policy and under the age and discount then the gdf prediction problem just reduces to canonical policy elevation that is the problem we have dedicated so much space in previous lectures but if instead for instance the cumulative is the still the main task reward but and we still predict an under agent policy but with this kind of zero well then this becomes an immediate reward prediction problem and we are not supposed to have only one prediction we're making so we could have n equivalents each corresponding for instance to one of the state variables or one of the features and if we predict these under the agent policy with a discount of zero then this becomes the next state prediction problem and of course we can you know we can go on and on we can consider many different cumulatives many different horizons many different hypothetical behaviors on which to predict these cumulatives and the beautiful thing about the gbf frameworks is that it allows us to just represent this very rich knowledge about the world but learning all of these predictions with the same mechanism so for instance we could use the same td algorithms that we reserved to to predict make value predictions for the main task reward to predict any of these and the main problem then becomes not how to learn such knowledge which sometimes we can address with the standard rl but how do we use these predictions this rich knowledge about the world to provide our agents with good representations that can support fast learning effective generalization and all the things that we are after one one beautifully simple approach would be to use the gbf's predictions directly as representations and this is what is called predictive state representation or psr and it's based on the argument that for a large sufficiently large and specially diverse set of gbf predictions these will be sufficient statistics for any other predictions that we might want to make including for instance the value predictions for the main task award therefore we can use the gdf predictions themselves as state features and then learn values or policies uh for the main task as linear functions of these predictions and this actually has a number of appealing properties and so i would encourage you to read the paper linked at the top to learn more about this but it's also not the only way of using gbf predictions another option for how to use gdfs for learning state representations is to use them as auxiliary tasks so this use of gps resembles a number of techniques from supervised learning um where forms of auxiliary tasks have been introduced to help with representation learning so i think for instance to the self-supervised learning objectives that are common in computer vision and this approach has the advantage of being especially well suited i think to be rl agents because of the because the compositional nature of neural networks allows to combine auxiliary predictions with the main task predictions with relative ease when we use gbfs as auxiliary tasks the way we typically do so is by sharing the bottom part of a neural network that we're using as function approximator between the main task prediction so for instance a policy prediction or a value prediction and all of the auxiliary gpf predictions and by doing so what what happens is that both the main tasks and auxiliary predictions become a function of a single shared hidden representation and then both the shared and unshared parameters can be optimized by minimizing jointly the losses associated with both types of predictions and the result is that the hidden share representation is forced to become more robust and more general and encode more about the world so this this specific way of using gbfs as auxiliary tasks was for instance implemented in the unreal agent it was introduced by jadabarg and a few others a few years ago so in unreal a neural network is used to map input observations to both a value prediction and a policy prediction because it's based on a on a fairly standard act operating system but additionally a 400 distinct cumulants are constructed as the average change in intensity of pixels between consecutive observations and then an additional head is connected to the hidden representation from which we make value and policy predictions up just after the the end of the convolutional stack and this auxiliary head is then trained to predict gdfs for each of these 400 humans and these auxiliary losses are then referred to as pixel control losses and are just summed to the standard policy and value losses and everything is optimized end-to-end the same kind of system can also be applied if we if the observations are not images or if the observations themselves were too big to constitute to use them directly to construct this this large number of cumulatives so for instance in the same paper they introduce another related gpf based auxiliary task which is what they called feature control again it works by constructing a large number of cumulatives but these are computed as the differences between the activations in the network itself between consecutive steps instead of being differences in the intensity of pixels but similar to pixel control once we have this large set of cumulatives however we derived them we then they they just learn the gbfs associated to each of these kilometers by having an auxiliary prediction head again that shares the bottom convolutional stack with the main task policy and the main task values but then is is forced to also support this additional auxiliary tasks and therefore learn a richer more effective representation basically it may seem as a small change but actually making these gbf predictions either in the pixel control type or the feature control type did make a real difference so for instance in this plot taken from the unreal paper you can see how using these auxiliary predictions so the blue line labeled unreal in this plot delivered a huge improvement in the raw performance of an actor critic system on a suite of challenging navigation tasks and this is despite these auxiliary predictions only affecting the rest of the system through the improvement in representational learning so for instance we're not using them to implement pcr for instance or for anything else except propagating gradients and updates into this shared representation and it might seem surprising at first that how can make all this disturbance difference after all the cumulatives that we are predicting uh these gps for don't seem to encode a particularly useful type of knowledge it may actually seem somewhat arbitrary and contrived and particularly in the pixel control setting and and even worse we're not actually making use of the resulting predictions in any way but but if you consider for instance pixel control making these predictions regarding the variation in intensity of pixels between consecutive observations actually requires the agent to understand many non-trivial aspects about the environment for instance how the agent actions affect the location of objects in the field of view of the agent so even though they may seem you know a bit contrived that these representation actually are forced their representation to encode a lot of useful knowledge and and that is why they're likely to then make their implementation more useful even for other tasks and this i want to stress is particularly important and particularly particularly useful in settings where maybe the main task reward itself is actually sparse and would therefore provide very little signal at least early in training while in contrast these auxiliary cumulatives that were constructed are different features or from observations can provide a very dense learning signal and then it can help kind of bootstrap representation learning even before the agent has had the chance to see the first reward for instance and then when it does actually see these rewards it can pick up on these more quickly and then learn much more effectively to understand though why using gbfs as your predictions is actually useful it's it's maybe worth thinking uh what happens to the feature representation to try to understand what is the actual effect of this so let's start with the linear function approximation case so this is the plot on the left and here we see that what happens in our unfortunate learning updates or to some value functions when we're doing linear function approximation is that we have some fixed feature representation and we construct a target value function using a suitable operator for instance a one step temporal difference operator and then this target value is projected onto the space that you can actually represent under the fixed speaker representation and the parameters are updated accordingly in deep reform learning this is the plot in the middle we have a more complex phenomenon again we construct some sort of target value function using a suitable operator but then we project on the space of value that we can represent not under the original feature representation but a new one that is updated to habit support as well as possible this new value target so we have both we're both changing the the final value predictions but also we're changing the representation itself to support these value predictions so what happens when we add auxiliary gbf predictions like the ones that we discussed with pixel control or feature control but what happens is that we're regularizing the second step so we are preventing the representation from becoming overly specific to the current value predictions and what we find at least empirically is that this regularization does seem to help quite a lot but by itself this interpretation while it helps understand what happens when when we use gvs as a clear task in a way it maybe raises a bit more questions than answers because after all isn't it desirable for the representation to be updated to support the value predictions as well as possible so why should regularizing the representation help in the first place and if it does which gdfs would provide the best regularization so let's try to answer these uh one at a time so the first thing to keep in mind to understand why using gbfs to regularize representation is useful in the first place is that over the course of learning we will actually need to approximate many value functions this is because if we're doing our job properly in rl the agents will get better over time therefore both our data distribution and our value predictions even in the same states that we have seen in the past change as the agent's behavior changes and this means that we want a representation that can support not just good value predictions right now but it can support approximating all all of the value functions in a way that that on this path in in value space that goes from the initial values of the initial policy all the way to the values of the optimal policy regulation can help us achieve this by preventing the representation from overfitting this where about as useful as effective so consider the space will be depicted on this slide that corresponds to the values of all the various policies that agents pass through over the course of training so so to understand how the different choices of gps will affect us will will affect the learning and it will make the representation more or less effective we need to look at how the choices of the target policies and humans affects the representation and the and how this interacts with all of the elements that we just defined so starting from left to the first step here representation learning is only driven by accurately predicting the current value so in this case there is nothing to force the representation to be well aligned with any other function except for a parameter so here we have a vector but these correspond to humans that are different from the main parts of the world this means that their value actually lives outside again also in this case there's actually no strong reason to believe that regularization will for instance and these actually correspond exactly to this second plot but it's good to realize there is a stronger report it might be sometimes higher the third platform captures instead of the case where we use gdf absolutely to predict the main tax reward over a fifteen set of different target policies that are different foundations so now the representation is forced to support a range of values within the polytope and given the geometric structure of this space of value function it actually can be shown that for a suitable set and choice of policies the news representation will capture the principal components of the value polytop and therefore we provide good supports to approximating values in the polysome and including the ones in the valley improvement path but unfortunately the exact solution to construct the right set of policy is computationally interactable so in the final plot we show a concrete approach to picking these policies in a way that is instead tractable and the intuition here is that the actual value improvement path so the set of values that we will care to predict during the course of learning is actually much smaller than whole polytope of all the possible value functions for all possible policies so maybe we should just target the values on this path and at each point in time while the future policies on the path are known we do have already at least passed through some sequence of policies and associated values during training this is basically a sequence of policies and values that we have been predicting so far so rather than picking the policies arbitrarily we could take advantage of the trajectory to pick a selective of a selection of policies as least aligned with the value improvement path up to the current moment and by picking these uh past policies as the the policies to use as targets in our auxiliary gps then we don't guarantee that these policies will induce a representation that optimally supports future values but at least it must support well um the values on a subset of the value improvement path and it provides us both an informed choice that is at least reasonable and in a choice that is computational attractable because we have access to these because we went through these policies during training and indeed this choice of gdf's auxiliary tasks was actually found to perform the best among all the choices that we discussed in a recent empirical study learning about multiple gps as an auxiliary task is basically turning agent learning into a multitask problem because we're now jointly training a shared representation to support many predictions that we can see as different tasks this is great for all the reasons that we discussed so far but you can also introduce a few challenges so when we want to our agents to learn as much as possible about the world and make all of these additional predictions we we need to face the fact that we only have limited resources so we have limited memory we have limited representation capacity computation and so on so different tasks will always find themselves competing with each other for these shared resources so any concrete system will actually need to define some way of trading off these competing domains and the important thing to realize is that there is always a trade-off so even if you don't make it explicit even if you don't do anything fancy about it then the system will make some trade-off so for instance the magnitude of the predictions and the induced gradients for for different tasks will be different for different gbf predictions and this magnitude will scale linearly with the frequency and the size of the individual cumulant so it will be quite different across predictions this means that the updates from the different predictions will basically be re-weighted accordingly in terms of how much they contribute to the to shaping the shared parameters so if we actually want these trade-offs to be sensible we need to think about them because otherwise we'll just be making some trade-offs but these might not be the trade-offs that we actually want for our agents understand how important and also how difficult the task is and how much the magnitude of the gradient can actually differ when making different types of predictions i think it's good to consider the graph in this slide so these three plots were generated by showing the gradient norms during training of a value-based agent on different atari games for three different types of agents so the different atari games here constitute different tasks that you might make value predictions for and in all the three plots the lines correspond to different percentiles of the magnitude of the gradient norm so this means that the width of these distribution gives you an idea of how diverse gradient magnets can be across different tasks of different predictions in this case the values of predicting the values in different entire games and what you see is that on the on the left and this is vanilla q learning the magnitude of the gradients actually spans eight orders of magnitude depending on which task you're in with basically gradient norms ranging from ten to the minus two to greater norms in the order of the millions and in the second plot we show what happens if the individual rewards are clipped to a small minus one to plus one range and then again vanilla q learning is applied so this reduces the range but is it is important to see that the grain norms actually still span almost four orders of magnitude and this is because it's not just the size of the individual cumulants that you're predicting in a different task that counts even if the individual rewards are of a similar size the frequency of these rewards will be different between tasks and the gradient magnitude actually scales with the value magnesium not the magnesium individual reward and furthermore if we look at even at the individual tasks um the grading magnitude actually changes during the course of training because they as the agent's behavior changes and the number and the size of rewards that the agent collects changes so does the magnitude of the updates and this is already a problem if you're training our leap rl agents on individual tasks because you can imagine how hard for instance it can be to tune hyper parameters so the learning dynamics can be so different across tasks but it also means that any naive multitask prediction problem such as predicting auxiliary gps will will be really hard to get right unless you do something to con to control how you're trading off the demands from different tasks because ideally what we would want is that across the different tasks that we're making gradients look like in the third plot on the slides the green plot here you can see that across all of these prediction tests the gradient magnets are actually confined within a reasonably small range and this means that we can then be explicit since we can assume that the gradients themselves have a similar magnitude then we can choose explicitly how we trade off between the different tasks we could just assign an equal weight in which case given that the gradients have been equals an equal magnitude then they would equally shape the representation or we can choose for instance to you know put a bigger weight on some tasks which we consider our main task and treat the others as auxiliary tests and maybe contributes to shaping their presentation but with a smaller weight the problem is how do we get there how do we get to the point where our gradient updates have a comparable magnitude across all the many different predictions and all the many different tasks that we their agent could be trained to to to learn so the way we get there is by using what is called an algorithm and we're using an algorithm that is called pop art which so that those plots are actually generated by running a vanilla learning algorithm but with pop art on top so but before i delve into exactly how this algorithm works um it's good to discuss another thing which is if the issues can be so dramatic when training the parallel systems to make different predictions why isn't it usually discussed in supervised learning because it also in supervised learning we sometimes use a multitask system and the reason is that in supervised learning we typically assume fixed data sets and this means that we can easily normalize both inputs and targets across the entire dataset for any number of target variables that we want to predict and everything will be always be well behaved and this is actually what we do in supervised learning we just don't even think much about it but we where we normalize variables because before feeding it into a deep learning system because it's such a trivial preprocessing that it doesn't require much thought but the problem is that in reinforcement learning we do not have access to a full dataset and the scale of prediction is even known stationary so it even changes over time which means that any normalization scheme will need to be adaptive to ch and to to always normalize appropriately across the duration of training and this is a much more complicated system and if that requires to actually think deeply about what we're doing luckily this problem was already addressed so for instance there are a few different ideas that we propose in the literature but one that i want to discuss today is is the pop art algorithm from the plot in a few slides ago so this was introduced by hado and me a few years ago and the algorithm works in two steps so the first step is what is called adaptive target normalization so consider any one prediction that you're making so for instance one of the gvs on each update you will typically observe some targets for that prediction could be for instance a q learning target which you construct you can construct for whatever cumulative you're learning a gbf for then what you what you can do with pop art is to normalize this target adaptively by keeping track of the first moment mu and the second moment new of the targets for that prediction so for instance by doing some exponential moving average of the targets associated to one of the gps and then you can update the network outputs to match not the original target but on each step use a normalized target that is constructed from the target for instance a q learning target by subtracting the first moment and dividing by the variance sigma which is estimated from the first and second moment by just subtracting the square of mu from nu and this will basically provide you a gradient update that is much better behaved in terms of magnitudes irrespectively of what is the size of the rewards what is the frequency of the rewards and so on and this means that if you apply this normalization independently to each gdf the gradients that apply that you will apply to the shared parameters of the of the network will contribute equally to the shared representation instead of having what one or more of the of the auxiliary predictions uh dominates the entire learning process importantly when doing this kind of normalization you can still recover the unnormalized q values by just multiplying the network outputs which are trained in this normalized space by these statistics sigma and mu and this is important because you actually need the unnormalized values in certain circumstances for instance to construct the targets via bootstrapping the problem with the adaptive normalization as we just discussed it here is that every time you update the normalization statistics which we typically do on each step because we're just keeping try a moving average you're actually normalizing the update in the current state but you're inadvertently changing it on normalized agent predictions in all other states and this doesn't seem good because there is no reason for indiscriminately changing the value of all other totally unrelated states and also it's not seen not only seems you know a bit fishy but it's also completely ad for instance non-stationary which we we know can make life harder for our prediction algorithms but luckily we can actually prevent this from happening at all with a very simple trick which i'll i'll discuss in this slide this is a based on the on the observation that most neural networks typically have a final fully connected or otherwise linear layer at the very end so you can effectively write the network output as a linear transform of the activations of the last hidden layer so the normalized q values will typically be some matrix w times v plus a bytes vector b for a suitable wnb and a suitable hidden representation v which in general will include any number of non-linear layers for instance some convolution layers with some reload activations and so on the insight from pop art is that every time you change normalization statistics you can actually undo this change in terms of the unnormalized predictions by making it the reverse in a reverse update to the weights and biases of the last layer and this can actually be done with a very simple formally formula in an exact way the way the way popar does it is by multiplying the weight matrix by the ratio between the old and the new scale factor and updating the bias as well with this slightly more complicated expression we just showed on this slide if you do this then we get the best of both worlds because on each step we can still normalize the targets in our gradient updates as in the previous slides but the unnormalized predictions that we use for instance for bootstrapping are not affected by this continuous change of this normalization statistics and this prevents any instabilities this merge has been actually very successful in the past so for instance in this plot we you can see what happens if you train a single agent to make value and policy predictions for 57 different atari games with all these predictions sharing the bottom layer of the network the version with pop art is the one shown in orange you can see how it performs really much better than any naive baseline that does not normalize the updates for the different tasks so the orange line actually gets to above human performance in aggregate acro across the 57 games while the other baselines actually struggle to reach even 60 defense 60 of human performance but the well this this plot shows specifically for the case of tyree it's important to notice that the approach is in no way specific to atari or the specific multitask setting and can be used whenever you want multiple predictions to use a shared representation but you want to trade off in in a sensible way their relative contributions as for instance in a gbf based auxiliary task scenario that we discussed in the previous slide in this section i want to discuss a few more advanced topics in gdf learning that we don't quite know yet how to tackle in order for still a very active arab research and i won't go into much detail of the specific solutions as the specific methods used to address these issues will likely change as our understanding of these problems improves instead i will give you a brief overview of what these problems are and just this neat peak of what the frontier of research looks like in this area the first important topic that we need to make progress on to really scale these ideas up is of policy learning so in something we discussed so far we already have multiple auxiliary gpfs that are only used as auxiliary and are learned from prediction experience that is generated by a different main task policy since the gps might refer to a different target policy in a different cube and a different discount learning these auxiliary tasks already will require of policy learning and as you know from the previous lecture we we have some tools to deal with of pulse learning but the reason i consider this still an open problem is that the degree of policiness that you might face in this setting where you might you're striving to learn like this rich knowledge about the world and many many different predictions from a single stream of experience the degree of policiness that you face here might be quite extreme and so i really think we will need fundamental improvements to our policy methods to really succeed in learning this fully diverse predictions about the world as auxiliary tasks and the another reason of policy learning is interesting is that in the context of gdf learning it's not only a challenge an obstacle to overcome but it's actually also potentially an opportunity because if we are predicting values for many different humans policy and discounts we could for instance use the additional predictions not just as auxiliary tasks but to generate experience that is more diverse and provide an excellent form of exploration even for learning some main task policy but how to best do so is still an open problem and again will require improvements also in in the how well our methods can cope with wildly of policy data still even though it is still a problem for we have at least some proof of concepts that this ideas can work and can provide meaningful improvements for instance in a unicorn paper from a couple of years ago we showed how a multi-task system learning about many tasks of varying difficulty but sharing experience between all these tasks so that each each prediction was learned off policies from data generated from all the from behavior that was induced by all other predictions then this sharing could be allowed to solve certain heart problems that were impossible if you are only striving to optimize for the hardest task another important problem that might need to be revisited in the context of gps learning is generalization so far we have treated gbfs as discrete sets of predictions that potentially share a hidden representation but are otherwise learned independently and but how do we scale this to thousands or millions of predictions learning about all of them independently might not actually be that effective so just like learning about the value of each state in mdp was not very effective so several people and are quite excited about investigating whether we can use a similar approach used to learn values in large state spaces and try to generalize what we learn about one gbc to other related gvc in some large space of predictions and problems one concrete approach to doing this is to feed the representation some representation of accumulant or discounts that we wish to make a prediction for as additional inputs to a single network that makes predictions for all gbs we're interested in so instead of having a network that only takes one state and outputs multiple independent predictions this would be a network that takes the the representation of which predictions you are required to make as inputs and then basically exposes do generalization both states and goals and tasks and cumulatives and discounts by using the same function approximations uh two techniques that we have used to generalize across states so this kind of gbfs where we attempt to generalize across different predictions are referred as universal value functions that are actually a very exciting arab research and deeper enforcement learning the third important open problem they want to mention is discovery so this is the problem where do gbfs come from even if we know how to learn about all of these of policy even if we know how to generalize across many different predictions where do the predictions themselves come from how do we pick which gbs to learn about so the previous section we discussed that there are many different ways we can construct gps right we can construct pistol-based human lens we can build feature based humans we can predict the main task reward under previous policies of the agents but while many of these work in practice and while specifically at least the value improvement path interpretation actually gives us at least a relatively principled way of picking gps the research in how to choose what to learn about is really really far from concluded so among the recent approaches are quite different from what we discussed supplier i want to briefly mention at least one that we introduced with a few colleagues including haddo in the paper discovery of useful questions as auxiliary tasks and here we proposed that maybe we should learn from experience what are the useful gbs that our agents should learn about and specifically we propose to do so by parameterizing the cumulus and the discounts that we want to learn about as neural networks and then use a form of metal learning called metagradients to kind of discover online what questions our agents should ask about the world and then try to learn um online while it's learning about the main task for instance and this actually resulted in quite a nice performance gains in atari for instance the final topic for today is what is referred to as distributional reinforcement already no are discussions so far if you think about it gbfs were still representing predictive knowledge but in the form of expectations so expectations of the cumulative discounted sum of some scholar quantity as usual another approach that has been proposed is to instead move towards learning distribution returns instead of expected value so this generalizes the usual prediction problem in a different direction so instead of changing the cumulative or the discount or the target policy that we're making predictions about it changes the type of prediction that we make so that we predict not expected values but full distributions of return so while we generalize in a different direction though it's good to realize that similarly to how predicting many gdfs can help with representation learning by providing some auxiliary task effect learning distributions instead of expected values could also provide a richer signal that could result in better and more robust learning however there is an important distinction between these two approaches while we when we for instance are learning gbfs as in the methods from the as in the previous slides we can reuse the same algorithms from you know the previous lectures from since the beginning of the course just apply them to different humans the problem of learning return distributions actually requires to expand extend our temporal difference algorithms in quite interesting ways and there's several concrete approaches that have been introduced in recent years but i'll discuss just a couple to give you at least a feel of how you can change temporal difference algorithm studio with distributions instead of expectations so the first instance that i want to talk about is what is called the categorical djin agent and the objective of this agent is to learn a categorical approximation of the true return distribution so how do we do this well first the agent needs to define some some fixed combo distribution to act as a support for expressing the categorical approximation of the returns so for instance we might allow the return to assume any fixed value between minus 10 minus 9.9 minus 9 plus 8 9.7 all the way up to plus 10. and then what we might do is we use a neural network to output now not the expected value as we would do traditionally but a vector of probabilities associated to each element of this support so that you can still recover the expected value by for instance computing the dot products between the fixed supports this com distribution that we have defined and the network probabilities through predictions important that you can still recover the expected value because it means that this is a strict generalization so we could for instance still do um what we do traditionally for selecting actions so we could for instance still select actions according to 3d policy with respect by choosing the action with the highest expected value but importantly the way we learn these values has now changed because instead of learning an expected value we have to learn a suitable probability predictions over a fixed categorical support so how do we do this how do we update probabilities that the probabilities that we associate to each possible value of the return well turns out that our temporal difference algorithms can actually be extended to this distributional setting in a relatively clean way so let's look into that this way so as usual we consider a transition so a tuple consisting of a status t or a word rt plus one and this kept gamma in the next state as t plus one what we can do then is to take the predict the network predictions in sd plus one this will provide in some sense our bootstrapping and but take the support of these predictions and shrink it by the discount gamma and shift it by the reward r two plus one then this transform the distributions will co will be a reasonable target for our predicted probabilities in the previous state as t this is a really vanilla transposition of how bootstrapping works for expected value but with an important cabinet and the cavity is that when we shrink and shift to the support of the distribution that we are bootstrapping from well the support doesn't match anymore the support of the distribution that we want to update the one in the previous state st so how do we update the probabilities to match this these two distributions well what we need is an additional step which is to project the new support onto the support that we are making predictions for allo reallocating basically a probability has to minimize this projector error projection error and then at that point we can just take the kl between the two distributions so the predicted distribution in sd and the shrink and the shifted distribution in the next state and just minimizes kl to update the probabilities in sd so this actually was shown to work really well in practice and has as a result it's kind of sparked a whole lot of new research in this area because um focusing either on how to use these distributions so can we do more than just provide a richer learning signal and also a lot of research on how we could alternatively represent and learn distributions of research so ideally to go beyond the somewhat crude categorical approximation that i just described so for instance just recently quantile regression was proposed as a way to kind of transpose the parameterization from categorical dqm so that instead of adjusting the probabilities of fixed to ports we instead adjust the support associated to fixed setup probabilities so this provides often better result because the support can now move around to approximate the distribution as well as possible and it's not constrained to a to a fixed range uh that is arbitrarily defined at the beginning and the and this means that it is strictly more flexible because the categorical approximation could instead be quite sensitive to the choice of the bounds of the fixed support and then there you of course there are more extensions you could think of you could maybe adjust both probabilities and the support and there is a lot of ongoing research on this problem that i think is quite exciting and interestingwelcome back to the second part of our introduction to the parallel in the first section we started by discussing how deep neural networks can be used for function approximation in rl and how automatic differentiation specifically can support doing so with relative ease and then we delved into the issues that arise when doing so when using deep learning for function approximation so for instance we discussed how the different choices that we make on the rail side affect the learning dynamics of approximate value functions through phenomena such like the deadly triad or literature propagation several of the issues that we discussed though were ultimately issues of inappropriate generalization so today i want to talk about some of the ideas that can be used to help with this problem by tackling directly the fundamental problem of representation learning in rl i want to stress that this is far from a solved problem so what i will discuss today is you can see it as a partial snapshot of recent research on this challenging issue but not an ultimate answer but the main insight that underlies many of the things that we will discuss today i think is quite important and is that so far our agents optimized the representation for a very narrow objective purely the prediction or maximization of a single scalar reward and this narrow objective is the only thing that is driving the entire representation learning in our d parallel agents and this has some advantages so the ability of building flexible rich representations that are kind of tailored to a specific task at hand is after all the main reason we use deep learning in the first place but it does come with some disadvantages because such a narrow objective can induce an overly specific overfitted state representation that might not support good generalization and this in turn can make agents even more susceptible to issues like the deadly triad if you agree with this premise then maybe the natural step is to ask our agents uh to learn about more than just a single task reward have them strive to build the richer knowledge about the world but of course this is you know in a sense it's easier said than done because we in order to do so we need to think of you know what other knowledge should they learn about and there are many many possible choices and since representation learning is not a problem that is exclusively url we can also tap into supervised learning literature for some inspiration but among the many possible ideas that can help build these representation i want to focus on two families of ideas that have attracted very very a lot of interest among our elder researchers in the past years and these are general value functions and distributional value predictions so let's start with the first general value functions if you recall from the beginning of this course rl is based on the so-called reward hypothesis that states that any goal can be represented as the maximization of a suitable scalar award and this hypothesis was originally discussed to argue that maybe arel as a whole is a sufficiently formalism for intelligent gold-oriented behavior but today i want to use this hypothesis to make a different point and argue that if this is the case maybe then all useful knowledge that agents should or could collect in order to support learning can also take the form of predictions about suitable cumulative scalar signals but importantly this predictive knowledge does not need to refer to the single main task circle reward instead the agent could make predictions about many different other scalar quantities and these predictions would still look very much like value functions but it would refer to a different scalar signal and are therefore typically called general value functions the general value function is in a sense very much like a standard value function in that it is a prediction about the expected cumulative discounted sum of a suitable scalar signal under a given policy but in general many functions we make explicit this dependency on the scalar signal c and discount factor gamma and the policy pi because we are open to make different choices for each of these so in the gvf language the scalar signal c that we choose to predict will be called the cumulant while the discount gamma associated with etf we'll still define a horizon for the predictions that we make about the death cumulant and the target policy pi is going to be an arbitrary behavior under which we compute expectations this will not necessarily be the agent policy of course this may still feel a bit abstract so let's be even more concrete let's let's consider some examples so if the cumulant is the main task reward and we compute the expected discounted cumulative sum of this signal under the agent policy and under the age and discount then the gdf prediction problem just reduces to canonical policy elevation that is the problem we have dedicated so much space in previous lectures but if instead for instance the cumulative is the still the main task reward but and we still predict an under agent policy but with this kind of zero well then this becomes an immediate reward prediction problem and we are not supposed to have only one prediction we're making so we could have n equivalents each corresponding for instance to one of the state variables or one of the features and if we predict these under the agent policy with a discount of zero then this becomes the next state prediction problem and of course we can you know we can go on and on we can consider many different cumulatives many different horizons many different hypothetical behaviors on which to predict these cumulatives and the beautiful thing about the gbf frameworks is that it allows us to just represent this very rich knowledge about the world but learning all of these predictions with the same mechanism so for instance we could use the same td algorithms that we reserved to to predict make value predictions for the main task reward to predict any of these and the main problem then becomes not how to learn such knowledge which sometimes we can address with the standard rl but how do we use these predictions this rich knowledge about the world to provide our agents with good representations that can support fast learning effective generalization and all the things that we are after one one beautifully simple approach would be to use the gbf's predictions directly as representations and this is what is called predictive state representation or psr and it's based on the argument that for a large sufficiently large and specially diverse set of gbf predictions these will be sufficient statistics for any other predictions that we might want to make including for instance the value predictions for the main task award therefore we can use the gdf predictions themselves as state features and then learn values or policies uh for the main task as linear functions of these predictions and this actually has a number of appealing properties and so i would encourage you to read the paper linked at the top to learn more about this but it's also not the only way of using gbf predictions another option for how to use gdfs for learning state representations is to use them as auxiliary tasks so this use of gps resembles a number of techniques from supervised learning um where forms of auxiliary tasks have been introduced to help with representation learning so i think for instance to the self-supervised learning objectives that are common in computer vision and this approach has the advantage of being especially well suited i think to be rl agents because of the because the compositional nature of neural networks allows to combine auxiliary predictions with the main task predictions with relative ease when we use gbfs as auxiliary tasks the way we typically do so is by sharing the bottom part of a neural network that we're using as function approximator between the main task prediction so for instance a policy prediction or a value prediction and all of the auxiliary gpf predictions and by doing so what what happens is that both the main tasks and auxiliary predictions become a function of a single shared hidden representation and then both the shared and unshared parameters can be optimized by minimizing jointly the losses associated with both types of predictions and the result is that the hidden share representation is forced to become more robust and more general and encode more about the world so this this specific way of using gbfs as auxiliary tasks was for instance implemented in the unreal agent it was introduced by jadabarg and a few others a few years ago so in unreal a neural network is used to map input observations to both a value prediction and a policy prediction because it's based on a on a fairly standard act operating system but additionally a 400 distinct cumulants are constructed as the average change in intensity of pixels between consecutive observations and then an additional head is connected to the hidden representation from which we make value and policy predictions up just after the the end of the convolutional stack and this auxiliary head is then trained to predict gdfs for each of these 400 humans and these auxiliary losses are then referred to as pixel control losses and are just summed to the standard policy and value losses and everything is optimized end-to-end the same kind of system can also be applied if we if the observations are not images or if the observations themselves were too big to constitute to use them directly to construct this this large number of cumulatives so for instance in the same paper they introduce another related gpf based auxiliary task which is what they called feature control again it works by constructing a large number of cumulatives but these are computed as the differences between the activations in the network itself between consecutive steps instead of being differences in the intensity of pixels but similar to pixel control once we have this large set of cumulatives however we derived them we then they they just learn the gbfs associated to each of these kilometers by having an auxiliary prediction head again that shares the bottom convolutional stack with the main task policy and the main task values but then is is forced to also support this additional auxiliary tasks and therefore learn a richer more effective representation basically it may seem as a small change but actually making these gbf predictions either in the pixel control type or the feature control type did make a real difference so for instance in this plot taken from the unreal paper you can see how using these auxiliary predictions so the blue line labeled unreal in this plot delivered a huge improvement in the raw performance of an actor critic system on a suite of challenging navigation tasks and this is despite these auxiliary predictions only affecting the rest of the system through the improvement in representational learning so for instance we're not using them to implement pcr for instance or for anything else except propagating gradients and updates into this shared representation and it might seem surprising at first that how can make all this disturbance difference after all the cumulatives that we are predicting uh these gps for don't seem to encode a particularly useful type of knowledge it may actually seem somewhat arbitrary and contrived and particularly in the pixel control setting and and even worse we're not actually making use of the resulting predictions in any way but but if you consider for instance pixel control making these predictions regarding the variation in intensity of pixels between consecutive observations actually requires the agent to understand many non-trivial aspects about the environment for instance how the agent actions affect the location of objects in the field of view of the agent so even though they may seem you know a bit contrived that these representation actually are forced their representation to encode a lot of useful knowledge and and that is why they're likely to then make their implementation more useful even for other tasks and this i want to stress is particularly important and particularly particularly useful in settings where maybe the main task reward itself is actually sparse and would therefore provide very little signal at least early in training while in contrast these auxiliary cumulatives that were constructed are different features or from observations can provide a very dense learning signal and then it can help kind of bootstrap representation learning even before the agent has had the chance to see the first reward for instance and then when it does actually see these rewards it can pick up on these more quickly and then learn much more effectively to understand though why using gbfs as your predictions is actually useful it's it's maybe worth thinking uh what happens to the feature representation to try to understand what is the actual effect of this so let's start with the linear function approximation case so this is the plot on the left and here we see that what happens in our unfortunate learning updates or to some value functions when we're doing linear function approximation is that we have some fixed feature representation and we construct a target value function using a suitable operator for instance a one step temporal difference operator and then this target value is projected onto the space that you can actually represent under the fixed speaker representation and the parameters are updated accordingly in deep reform learning this is the plot in the middle we have a more complex phenomenon again we construct some sort of target value function using a suitable operator but then we project on the space of value that we can represent not under the original feature representation but a new one that is updated to habit support as well as possible this new value target so we have both we're both changing the the final value predictions but also we're changing the representation itself to support these value predictions so what happens when we add auxiliary gbf predictions like the ones that we discussed with pixel control or feature control but what happens is that we're regularizing the second step so we are preventing the representation from becoming overly specific to the current value predictions and what we find at least empirically is that this regularization does seem to help quite a lot but by itself this interpretation while it helps understand what happens when when we use gvs as a clear task in a way it maybe raises a bit more questions than answers because after all isn't it desirable for the representation to be updated to support the value predictions as well as possible so why should regularizing the representation help in the first place and if it does which gdfs would provide the best regularization so let's try to answer these uh one at a time so the first thing to keep in mind to understand why using gbfs to regularize representation is useful in the first place is that over the course of learning we will actually need to approximate many value functions this is because if we're doing our job properly in rl the agents will get better over time therefore both our data distribution and our value predictions even in the same states that we have seen in the past change as the agent's behavior changes and this means that we want a representation that can support not just good value predictions right now but it can support approximating all all of the value functions in a way that that on this path in in value space that goes from the initial values of the initial policy all the way to the values of the optimal policy regulation can help us achieve this by preventing the representation from overfitting this where about as useful as effective so consider the space will be depicted on this slide that corresponds to the values of all the various policies that agents pass through over the course of training so so to understand how the different choices of gps will affect us will will affect the learning and it will make the representation more or less effective we need to look at how the choices of the target policies and humans affects the representation and the and how this interacts with all of the elements that we just defined so starting from left to the first step here representation learning is only driven by accurately predicting the current value so in this case there is nothing to force the representation to be well aligned with any other function except for a parameter so here we have a vector but these correspond to humans that are different from the main parts of the world this means that their value actually lives outside again also in this case there's actually no strong reason to believe that regularization will for instance and these actually correspond exactly to this second plot but it's good to realize there is a stronger report it might be sometimes higher the third platform captures instead of the case where we use gdf absolutely to predict the main tax reward over a fifteen set of different target policies that are different foundations so now the representation is forced to support a range of values within the polytope and given the geometric structure of this space of value function it actually can be shown that for a suitable set and choice of policies the news representation will capture the principal components of the value polytop and therefore we provide good supports to approximating values in the polysome and including the ones in the valley improvement path but unfortunately the exact solution to construct the right set of policy is computationally interactable so in the final plot we show a concrete approach to picking these policies in a way that is instead tractable and the intuition here is that the actual value improvement path so the set of values that we will care to predict during the course of learning is actually much smaller than whole polytope of all the possible value functions for all possible policies so maybe we should just target the values on this path and at each point in time while the future policies on the path are known we do have already at least passed through some sequence of policies and associated values during training this is basically a sequence of policies and values that we have been predicting so far so rather than picking the policies arbitrarily we could take advantage of the trajectory to pick a selective of a selection of policies as least aligned with the value improvement path up to the current moment and by picking these uh past policies as the the policies to use as targets in our auxiliary gps then we don't guarantee that these policies will induce a representation that optimally supports future values but at least it must support well um the values on a subset of the value improvement path and it provides us both an informed choice that is at least reasonable and in a choice that is computational attractable because we have access to these because we went through these policies during training and indeed this choice of gdf's auxiliary tasks was actually found to perform the best among all the choices that we discussed in a recent empirical study learning about multiple gps as an auxiliary task is basically turning agent learning into a multitask problem because we're now jointly training a shared representation to support many predictions that we can see as different tasks this is great for all the reasons that we discussed so far but you can also introduce a few challenges so when we want to our agents to learn as much as possible about the world and make all of these additional predictions we we need to face the fact that we only have limited resources so we have limited memory we have limited representation capacity computation and so on so different tasks will always find themselves competing with each other for these shared resources so any concrete system will actually need to define some way of trading off these competing domains and the important thing to realize is that there is always a trade-off so even if you don't make it explicit even if you don't do anything fancy about it then the system will make some trade-off so for instance the magnitude of the predictions and the induced gradients for for different tasks will be different for different gbf predictions and this magnitude will scale linearly with the frequency and the size of the individual cumulant so it will be quite different across predictions this means that the updates from the different predictions will basically be re-weighted accordingly in terms of how much they contribute to the to shaping the shared parameters so if we actually want these trade-offs to be sensible we need to think about them because otherwise we'll just be making some trade-offs but these might not be the trade-offs that we actually want for our agents understand how important and also how difficult the task is and how much the magnitude of the gradient can actually differ when making different types of predictions i think it's good to consider the graph in this slide so these three plots were generated by showing the gradient norms during training of a value-based agent on different atari games for three different types of agents so the different atari games here constitute different tasks that you might make value predictions for and in all the three plots the lines correspond to different percentiles of the magnitude of the gradient norm so this means that the width of these distribution gives you an idea of how diverse gradient magnets can be across different tasks of different predictions in this case the values of predicting the values in different entire games and what you see is that on the on the left and this is vanilla q learning the magnitude of the gradients actually spans eight orders of magnitude depending on which task you're in with basically gradient norms ranging from ten to the minus two to greater norms in the order of the millions and in the second plot we show what happens if the individual rewards are clipped to a small minus one to plus one range and then again vanilla q learning is applied so this reduces the range but is it is important to see that the grain norms actually still span almost four orders of magnitude and this is because it's not just the size of the individual cumulants that you're predicting in a different task that counts even if the individual rewards are of a similar size the frequency of these rewards will be different between tasks and the gradient magnitude actually scales with the value magnesium not the magnesium individual reward and furthermore if we look at even at the individual tasks um the grading magnitude actually changes during the course of training because they as the agent's behavior changes and the number and the size of rewards that the agent collects changes so does the magnitude of the updates and this is already a problem if you're training our leap rl agents on individual tasks because you can imagine how hard for instance it can be to tune hyper parameters so the learning dynamics can be so different across tasks but it also means that any naive multitask prediction problem such as predicting auxiliary gps will will be really hard to get right unless you do something to con to control how you're trading off the demands from different tasks because ideally what we would want is that across the different tasks that we're making gradients look like in the third plot on the slides the green plot here you can see that across all of these prediction tests the gradient magnets are actually confined within a reasonably small range and this means that we can then be explicit since we can assume that the gradients themselves have a similar magnitude then we can choose explicitly how we trade off between the different tasks we could just assign an equal weight in which case given that the gradients have been equals an equal magnitude then they would equally shape the representation or we can choose for instance to you know put a bigger weight on some tasks which we consider our main task and treat the others as auxiliary tests and maybe contributes to shaping their presentation but with a smaller weight the problem is how do we get there how do we get to the point where our gradient updates have a comparable magnitude across all the many different predictions and all the many different tasks that we their agent could be trained to to to learn so the way we get there is by using what is called an algorithm and we're using an algorithm that is called pop art which so that those plots are actually generated by running a vanilla learning algorithm but with pop art on top so but before i delve into exactly how this algorithm works um it's good to discuss another thing which is if the issues can be so dramatic when training the parallel systems to make different predictions why isn't it usually discussed in supervised learning because it also in supervised learning we sometimes use a multitask system and the reason is that in supervised learning we typically assume fixed data sets and this means that we can easily normalize both inputs and targets across the entire dataset for any number of target variables that we want to predict and everything will be always be well behaved and this is actually what we do in supervised learning we just don't even think much about it but we where we normalize variables because before feeding it into a deep learning system because it's such a trivial preprocessing that it doesn't require much thought but the problem is that in reinforcement learning we do not have access to a full dataset and the scale of prediction is even known stationary so it even changes over time which means that any normalization scheme will need to be adaptive to ch and to to always normalize appropriately across the duration of training and this is a much more complicated system and if that requires to actually think deeply about what we're doing luckily this problem was already addressed so for instance there are a few different ideas that we propose in the literature but one that i want to discuss today is is the pop art algorithm from the plot in a few slides ago so this was introduced by hado and me a few years ago and the algorithm works in two steps so the first step is what is called adaptive target normalization so consider any one prediction that you're making so for instance one of the gvs on each update you will typically observe some targets for that prediction could be for instance a q learning target which you construct you can construct for whatever cumulative you're learning a gbf for then what you what you can do with pop art is to normalize this target adaptively by keeping track of the first moment mu and the second moment new of the targets for that prediction so for instance by doing some exponential moving average of the targets associated to one of the gps and then you can update the network outputs to match not the original target but on each step use a normalized target that is constructed from the target for instance a q learning target by subtracting the first moment and dividing by the variance sigma which is estimated from the first and second moment by just subtracting the square of mu from nu and this will basically provide you a gradient update that is much better behaved in terms of magnitudes irrespectively of what is the size of the rewards what is the frequency of the rewards and so on and this means that if you apply this normalization independently to each gdf the gradients that apply that you will apply to the shared parameters of the of the network will contribute equally to the shared representation instead of having what one or more of the of the auxiliary predictions uh dominates the entire learning process importantly when doing this kind of normalization you can still recover the unnormalized q values by just multiplying the network outputs which are trained in this normalized space by these statistics sigma and mu and this is important because you actually need the unnormalized values in certain circumstances for instance to construct the targets via bootstrapping the problem with the adaptive normalization as we just discussed it here is that every time you update the normalization statistics which we typically do on each step because we're just keeping try a moving average you're actually normalizing the update in the current state but you're inadvertently changing it on normalized agent predictions in all other states and this doesn't seem good because there is no reason for indiscriminately changing the value of all other totally unrelated states and also it's not seen not only seems you know a bit fishy but it's also completely ad for instance non-stationary which we we know can make life harder for our prediction algorithms but luckily we can actually prevent this from happening at all with a very simple trick which i'll i'll discuss in this slide this is a based on the on the observation that most neural networks typically have a final fully connected or otherwise linear layer at the very end so you can effectively write the network output as a linear transform of the activations of the last hidden layer so the normalized q values will typically be some matrix w times v plus a bytes vector b for a suitable wnb and a suitable hidden representation v which in general will include any number of non-linear layers for instance some convolution layers with some reload activations and so on the insight from pop art is that every time you change normalization statistics you can actually undo this change in terms of the unnormalized predictions by making it the reverse in a reverse update to the weights and biases of the last layer and this can actually be done with a very simple formally formula in an exact way the way the way popar does it is by multiplying the weight matrix by the ratio between the old and the new scale factor and updating the bias as well with this slightly more complicated expression we just showed on this slide if you do this then we get the best of both worlds because on each step we can still normalize the targets in our gradient updates as in the previous slides but the unnormalized predictions that we use for instance for bootstrapping are not affected by this continuous change of this normalization statistics and this prevents any instabilities this merge has been actually very successful in the past so for instance in this plot we you can see what happens if you train a single agent to make value and policy predictions for 57 different atari games with all these predictions sharing the bottom layer of the network the version with pop art is the one shown in orange you can see how it performs really much better than any naive baseline that does not normalize the updates for the different tasks so the orange line actually gets to above human performance in aggregate acro across the 57 games while the other baselines actually struggle to reach even 60 defense 60 of human performance but the well this this plot shows specifically for the case of tyree it's important to notice that the approach is in no way specific to atari or the specific multitask setting and can be used whenever you want multiple predictions to use a shared representation but you want to trade off in in a sensible way their relative contributions as for instance in a gbf based auxiliary task scenario that we discussed in the previous slide in this section i want to discuss a few more advanced topics in gdf learning that we don't quite know yet how to tackle in order for still a very active arab research and i won't go into much detail of the specific solutions as the specific methods used to address these issues will likely change as our understanding of these problems improves instead i will give you a brief overview of what these problems are and just this neat peak of what the frontier of research looks like in this area the first important topic that we need to make progress on to really scale these ideas up is of policy learning so in something we discussed so far we already have multiple auxiliary gpfs that are only used as auxiliary and are learned from prediction experience that is generated by a different main task policy since the gps might refer to a different target policy in a different cube and a different discount learning these auxiliary tasks already will require of policy learning and as you know from the previous lecture we we have some tools to deal with of pulse learning but the reason i consider this still an open problem is that the degree of policiness that you might face in this setting where you might you're striving to learn like this rich knowledge about the world and many many different predictions from a single stream of experience the degree of policiness that you face here might be quite extreme and so i really think we will need fundamental improvements to our policy methods to really succeed in learning this fully diverse predictions about the world as auxiliary tasks and the another reason of policy learning is interesting is that in the context of gdf learning it's not only a challenge an obstacle to overcome but it's actually also potentially an opportunity because if we are predicting values for many different humans policy and discounts we could for instance use the additional predictions not just as auxiliary tasks but to generate experience that is more diverse and provide an excellent form of exploration even for learning some main task policy but how to best do so is still an open problem and again will require improvements also in in the how well our methods can cope with wildly of policy data still even though it is still a problem for we have at least some proof of concepts that this ideas can work and can provide meaningful improvements for instance in a unicorn paper from a couple of years ago we showed how a multi-task system learning about many tasks of varying difficulty but sharing experience between all these tasks so that each each prediction was learned off policies from data generated from all the from behavior that was induced by all other predictions then this sharing could be allowed to solve certain heart problems that were impossible if you are only striving to optimize for the hardest task another important problem that might need to be revisited in the context of gps learning is generalization so far we have treated gbfs as discrete sets of predictions that potentially share a hidden representation but are otherwise learned independently and but how do we scale this to thousands or millions of predictions learning about all of them independently might not actually be that effective so just like learning about the value of each state in mdp was not very effective so several people and are quite excited about investigating whether we can use a similar approach used to learn values in large state spaces and try to generalize what we learn about one gbc to other related gvc in some large space of predictions and problems one concrete approach to doing this is to feed the representation some representation of accumulant or discounts that we wish to make a prediction for as additional inputs to a single network that makes predictions for all gbs we're interested in so instead of having a network that only takes one state and outputs multiple independent predictions this would be a network that takes the the representation of which predictions you are required to make as inputs and then basically exposes do generalization both states and goals and tasks and cumulatives and discounts by using the same function approximations uh two techniques that we have used to generalize across states so this kind of gbfs where we attempt to generalize across different predictions are referred as universal value functions that are actually a very exciting arab research and deeper enforcement learning the third important open problem they want to mention is discovery so this is the problem where do gbfs come from even if we know how to learn about all of these of policy even if we know how to generalize across many different predictions where do the predictions themselves come from how do we pick which gbs to learn about so the previous section we discussed that there are many different ways we can construct gps right we can construct pistol-based human lens we can build feature based humans we can predict the main task reward under previous policies of the agents but while many of these work in practice and while specifically at least the value improvement path interpretation actually gives us at least a relatively principled way of picking gps the research in how to choose what to learn about is really really far from concluded so among the recent approaches are quite different from what we discussed supplier i want to briefly mention at least one that we introduced with a few colleagues including haddo in the paper discovery of useful questions as auxiliary tasks and here we proposed that maybe we should learn from experience what are the useful gbs that our agents should learn about and specifically we propose to do so by parameterizing the cumulus and the discounts that we want to learn about as neural networks and then use a form of metal learning called metagradients to kind of discover online what questions our agents should ask about the world and then try to learn um online while it's learning about the main task for instance and this actually resulted in quite a nice performance gains in atari for instance the final topic for today is what is referred to as distributional reinforcement already no are discussions so far if you think about it gbfs were still representing predictive knowledge but in the form of expectations so expectations of the cumulative discounted sum of some scholar quantity as usual another approach that has been proposed is to instead move towards learning distribution returns instead of expected value so this generalizes the usual prediction problem in a different direction so instead of changing the cumulative or the discount or the target policy that we're making predictions about it changes the type of prediction that we make so that we predict not expected values but full distributions of return so while we generalize in a different direction though it's good to realize that similarly to how predicting many gdfs can help with representation learning by providing some auxiliary task effect learning distributions instead of expected values could also provide a richer signal that could result in better and more robust learning however there is an important distinction between these two approaches while we when we for instance are learning gbfs as in the methods from the as in the previous slides we can reuse the same algorithms from you know the previous lectures from since the beginning of the course just apply them to different humans the problem of learning return distributions actually requires to expand extend our temporal difference algorithms in quite interesting ways and there's several concrete approaches that have been introduced in recent years but i'll discuss just a couple to give you at least a feel of how you can change temporal difference algorithm studio with distributions instead of expectations so the first instance that i want to talk about is what is called the categorical djin agent and the objective of this agent is to learn a categorical approximation of the true return distribution so how do we do this well first the agent needs to define some some fixed combo distribution to act as a support for expressing the categorical approximation of the returns so for instance we might allow the return to assume any fixed value between minus 10 minus 9.9 minus 9 plus 8 9.7 all the way up to plus 10. and then what we might do is we use a neural network to output now not the expected value as we would do traditionally but a vector of probabilities associated to each element of this support so that you can still recover the expected value by for instance computing the dot products between the fixed supports this com distribution that we have defined and the network probabilities through predictions important that you can still recover the expected value because it means that this is a strict generalization so we could for instance still do um what we do traditionally for selecting actions so we could for instance still select actions according to 3d policy with respect by choosing the action with the highest expected value but importantly the way we learn these values has now changed because instead of learning an expected value we have to learn a suitable probability predictions over a fixed categorical support so how do we do this how do we update probabilities that the probabilities that we associate to each possible value of the return well turns out that our temporal difference algorithms can actually be extended to this distributional setting in a relatively clean way so let's look into that this way so as usual we consider a transition so a tuple consisting of a status t or a word rt plus one and this kept gamma in the next state as t plus one what we can do then is to take the predict the network predictions in sd plus one this will provide in some sense our bootstrapping and but take the support of these predictions and shrink it by the discount gamma and shift it by the reward r two plus one then this transform the distributions will co will be a reasonable target for our predicted probabilities in the previous state as t this is a really vanilla transposition of how bootstrapping works for expected value but with an important cabinet and the cavity is that when we shrink and shift to the support of the distribution that we are bootstrapping from well the support doesn't match anymore the support of the distribution that we want to update the one in the previous state st so how do we update the probabilities to match this these two distributions well what we need is an additional step which is to project the new support onto the support that we are making predictions for allo reallocating basically a probability has to minimize this projector error projection error and then at that point we can just take the kl between the two distributions so the predicted distribution in sd and the shrink and the shifted distribution in the next state and just minimizes kl to update the probabilities in sd so this actually was shown to work really well in practice and has as a result it's kind of sparked a whole lot of new research in this area because um focusing either on how to use these distributions so can we do more than just provide a richer learning signal and also a lot of research on how we could alternatively represent and learn distributions of research so ideally to go beyond the somewhat crude categorical approximation that i just described so for instance just recently quantile regression was proposed as a way to kind of transpose the parameterization from categorical dqm so that instead of adjusting the probabilities of fixed to ports we instead adjust the support associated to fixed setup probabilities so this provides often better result because the support can now move around to approximate the distribution as well as possible and it's not constrained to a to a fixed range uh that is arbitrarily defined at the beginning and the and this means that it is strictly more flexible because the categorical approximation could instead be quite sensitive to the choice of the bounds of the fixed support and then there you of course there are more extensions you could think of you could maybe adjust both probabilities and the support and there is a lot of ongoing research on this problem that i think is quite exciting and interesting\n"