Reinforcement Learning with Unsupervised Auxiliary Tasks

**Exploring the Auxiliary Tasks in Deep Deterministic Policy Gradient (DDPG) Algorithms**

In recent years, reinforcement learning has become a crucial area of research in artificial intelligence, with applications in areas such as robotics, game playing, and autonomous vehicles. One popular algorithm that has garnered significant attention is the Deep Deterministic Policy Gradient (DDPG) algorithm. In this article, we will delve into the world of DDPG algorithms and explore one specific variation known as Auxiliary Tasks.

**The Task of Learning Auxiliary Tasks**

The authors of the paper propose an extension to the basic DDPG algorithm by introducing auxiliary tasks. These tasks are designed to help the agent learn more useful features and policies. The two tasks proposed in the paper are "pixel changes" and "network features". Pixel changes refer to the difference between the input images, while network features are the outputs of the neural network. By learning these tasks, the agent can acquire additional knowledge that is not directly related to the main task.

**The Role of Auxiliary Tasks in Learning**

The authors argue that auxiliary tasks play a crucial role in learning more useful features and policies. By doing so, they hope to improve the overall performance of the agent on the main task. The idea behind this approach is that by trying to learn these additional tasks, the agent can also learn something useful for the actual task. This is achieved through a process called "off-policy learning", where the agent learns from experiences collected during the training phase and then uses this knowledge to improve its policy.

**The Design of the Auxiliary Tasks**

In order to design effective auxiliary tasks, the authors propose a simple approach. They start by defining two types of auxiliary tasks: "daily tasks" and "skilled tasks". Daily tasks are designed to help the agent learn simple, everyday skills, while skilled tasks are more complex and require higher levels of expertise. The authors argue that these tasks are general enough to be applied to many different problems, but also specific enough to provide useful feedback for the agent.

**The Choice of Auxiliary Tasks**

One criticism of the paper is that the choice of auxiliary tasks is somewhat arbitrary. The authors leave it up to the implementer to decide which tasks to use, and this can lead to varying results depending on the task at hand. For example, some tasks may be more relevant to the main problem than others. In order to address this criticism, the authors propose a framework for selecting auxiliary tasks that takes into account the characteristics of the main task.

**The Evaluation of the Auxiliary Tasks**

In order to evaluate the effectiveness of the auxiliary tasks, the authors conduct a series of experiments on various problems. The results show that the use of auxiliary tasks leads to improved performance on the main task. However, it's also worth noting that the paper does not provide an exhaustive evaluation of the auxiliary tasks and their impact on the overall system.

**The Criticisms**

One criticism of the paper is that the choice of auxiliary tasks is too broad. The authors propose two types of auxiliary tasks: daily tasks and skilled tasks. While these tasks may be relevant to some problems, they may not be applicable to others. Additionally, the paper does not provide an exhaustive evaluation of the auxiliary tasks and their impact on the overall system.

**The Future Directions**

In order to further improve the DDPG algorithm, researchers should focus on developing more effective auxiliary tasks that are tailored to specific problems. This could involve using domain-specific knowledge or incorporating additional feedback mechanisms into the training process. By doing so, we can develop more robust and efficient reinforcement learning algorithms that can tackle complex problems in a variety of domains.

**The Philosophical Question**

There is also a philosophical question at play here. If you take a basic DDPG algorithm and augment it with auxiliary tasks, but not necessarily other advanced techniques like reward prediction or sampling strategies, how much credit do you deserve for the improvements? Is it fair to say that the augmentation of the task is the reason for the improvement, or are there other factors at play?

**The Conclusion**

In conclusion, the use of auxiliary tasks in DDPG algorithms has shown promising results. By incorporating additional features and policies, the agent can acquire more knowledge and improve its performance on the main task. However, it's also important to acknowledge the limitations and criticisms of this approach. By continuing to evaluate and refine these techniques, we can develop more robust and efficient reinforcement learning algorithms that can tackle complex problems in a variety of domains.

"WEBVTTKind: captionsLanguage: enhi there today we're looking at reinforcement learning with unsupervised auxiliary tasks by Google so in this paper the author's consider a reinforcement learning task and I can show you what it looks like it looks like this kind of a maze or this is an example that they give where you have to navigate the maze it's 3d and you have to navigate from pixel inputs you have to collect apples and reach the goal and this gives you rewards so on the left you can see what the agents actually see on the right you can see it from a top-down view the problem is of course that the input is very or the reward is very sparse meaning that you have to navigate a lot of maze before you even get a single point so reinforcement learning has a big trouble with this because it relies on constant reward to notice what actions are good and what actions are bad so the author's proposes in addition to the regular loss and that you would have so your reward which is this thing you would also have an additional set of auxiliary tasks and here C goes over the observe observe you control tasks that you specify each of those has a reward and you're also trying to maximize these each with with some kind of a weight here and the thing is that the parameters that you maximize over control all of the different tasks so they are partly shared between the tasks so what you're hoping is that by kind of learning to do one thing you also learn to do another thing so the difference between this and let's say you might have so we've seen kind of work of it like this before where you do it in more like an autoencoder setting so for example you can't agencies the input on the left here and it kind of tries to predict what the next in but we'll be what the next frame will be developed behind this is if you can accurately predict what the next frame will be maybe it learned something useful about the environment in this work it's different because now we couple a reward to these tasks and I can show you here what the authors propose as additional rewards sorry they're further on top let me go there especially they considered here these two auxiliary control tasks so pixel changes which means that the agent actually tries to actively change pixels so it gets a reward for changing the pixels in the input so it tries to maximize this it needs to learn what do I need to do to maximize my pixel changes and probably that will be moving around so we will learn to kind of move around not move against the wall because if it moves against the wall the pixels pixels won't change so it will kind of learn to move along the the like how a regular human agent would also move speak not into a wall not like into a dead end or something such that the pixels always change of course it's not perfect you can also change your pixels quite a bit by simply spinning around in a circle but this is one of the early tasks that they are meant the agent with the other one is Network features so it's kind of a meta learning here you actually reward the agent for changing its own internal activations so the hope is that it kind of learns about something about itself how can i activate my internal neural network units and it gets rewarded for that so we might want to activate a lot of them and want to learn how they're activated so this kind of self introspection you also hope that it kind of leads to a network does more sophisticated tasks or that by nature of trying to get most pixel pixel changes and the most network feature activated that you also learn something useful for the actual task um so these are the two tasks they propose in addition they also do and they have a drawing this over here they also do a lot of other things namely on the top left you can kind of see here that what's a database agent this is an a3 see agent meaning that it's an it's an active critic so you learn a policy and you learn a value network we might go over this in a future video school just consider this a standard reinforcement learning agent you feed its experience into a replay buffer and out of the replay buffer you do many things so for one you try to learn these auxiliary tasks note that these are shared parameters between all these networks that's why I do daily tasks actually help you also try to better learn your value function and they call this off policy learning because you kind of pause the reciting training for a while and then you train the value function some more just because that helps you also try a reward prediction in here and the way they do it as I explained is kind of in a skewed sampling way so how do all the situation's you can be in the agent will have a reward very very few times so what they do is they simply sample out of the replay buffer out of all the experiences they had so far they sample more frequently the experiences where they actually gotten a reward that way that the whole is of course the agent if you if you look at when you can zoom in here if you look at the the experience here where you actually get an apple and the agent might learn a lot faster or there's some kind of Apple there and I move towards get a reward so that's the the hope that you instantly recognize high reward situations and kanda are not so interested in non reward situations of course it doesn't reduce bias in your sampling and you might decide for yourself if that's good or bad here it seems to work so there's a lot of experiments in this task and labyrinth tasks and they of course as with research they read state of the art they're much better than anything else no I mean they don't boast as much so it's actually a fair comparisons the criticisms so they also evaluate a motor against the criticisms that I have are twofold first of all the choice of ability tasks is completely up to the implementer which means that I have to decide as an implementer of this algorithm what my Tillery tasks will be and here pixel changes and Network features they seem like fairly general tasks that you could apply to a lot of these kind of problems but it always kind of comes down to how much knowledge about the task would you like to go into the into the actor and here I mean you can see it makes sense to get at least the pixel changes as an auxiliary task but it's questionable how much of kind of domain knowledge this already encodes so the fact the choice of these are certainly something that you have to decide as a human and I think these are these are good choices so they're not too domain specific but also they do correspond to like some visual moving around game tasks and the other um kind of criticisms not really criticism is just a remark is that they do a lot of a lot of things so I mean the paper is about the auxiliary tasks but they also then do these skimmed sampling and the policy value learning and so on and of course you can kind of argue yeah this is all done you know the reinforcement learning tasks that's why it's a fair comparison I guess it's a philosophical question if you want to reach state of the art of course you have to first of all get a better a better method here this would be the auxiliary tasks this is the new idea and then implement all the tricks that the the other people have discovered which is good because you kind of reach the highest performance you can get but also the problem is you make it harder to compare you make it harder to see where the improvement is coming from have you simply chosen better high parameters for the reward predictions of things have you simply is there any interactions maybe between the auxiliary tasks and dispute sampling part all these kind of things wash out and it's not really clear where the improvement is coming from on the other hand if you simply take a basic basic basic algorithm like just a three see here on the top left and you augment it with nothing but these are the early tasks the bottom left then and then you see an improvement you can be relatively sure it's due to your new idea but of course you won't reach any state-of-the-art numbers because everyone that does a3 see also does these tricks philosophical question Here I am standing more on the side of not doing the tricks or maybe doing both yeah decide for yourself and have a nice dayhi there today we're looking at reinforcement learning with unsupervised auxiliary tasks by Google so in this paper the author's consider a reinforcement learning task and I can show you what it looks like it looks like this kind of a maze or this is an example that they give where you have to navigate the maze it's 3d and you have to navigate from pixel inputs you have to collect apples and reach the goal and this gives you rewards so on the left you can see what the agents actually see on the right you can see it from a top-down view the problem is of course that the input is very or the reward is very sparse meaning that you have to navigate a lot of maze before you even get a single point so reinforcement learning has a big trouble with this because it relies on constant reward to notice what actions are good and what actions are bad so the author's proposes in addition to the regular loss and that you would have so your reward which is this thing you would also have an additional set of auxiliary tasks and here C goes over the observe observe you control tasks that you specify each of those has a reward and you're also trying to maximize these each with with some kind of a weight here and the thing is that the parameters that you maximize over control all of the different tasks so they are partly shared between the tasks so what you're hoping is that by kind of learning to do one thing you also learn to do another thing so the difference between this and let's say you might have so we've seen kind of work of it like this before where you do it in more like an autoencoder setting so for example you can't agencies the input on the left here and it kind of tries to predict what the next in but we'll be what the next frame will be developed behind this is if you can accurately predict what the next frame will be maybe it learned something useful about the environment in this work it's different because now we couple a reward to these tasks and I can show you here what the authors propose as additional rewards sorry they're further on top let me go there especially they considered here these two auxiliary control tasks so pixel changes which means that the agent actually tries to actively change pixels so it gets a reward for changing the pixels in the input so it tries to maximize this it needs to learn what do I need to do to maximize my pixel changes and probably that will be moving around so we will learn to kind of move around not move against the wall because if it moves against the wall the pixels pixels won't change so it will kind of learn to move along the the like how a regular human agent would also move speak not into a wall not like into a dead end or something such that the pixels always change of course it's not perfect you can also change your pixels quite a bit by simply spinning around in a circle but this is one of the early tasks that they are meant the agent with the other one is Network features so it's kind of a meta learning here you actually reward the agent for changing its own internal activations so the hope is that it kind of learns about something about itself how can i activate my internal neural network units and it gets rewarded for that so we might want to activate a lot of them and want to learn how they're activated so this kind of self introspection you also hope that it kind of leads to a network does more sophisticated tasks or that by nature of trying to get most pixel pixel changes and the most network feature activated that you also learn something useful for the actual task um so these are the two tasks they propose in addition they also do and they have a drawing this over here they also do a lot of other things namely on the top left you can kind of see here that what's a database agent this is an a3 see agent meaning that it's an it's an active critic so you learn a policy and you learn a value network we might go over this in a future video school just consider this a standard reinforcement learning agent you feed its experience into a replay buffer and out of the replay buffer you do many things so for one you try to learn these auxiliary tasks note that these are shared parameters between all these networks that's why I do daily tasks actually help you also try to better learn your value function and they call this off policy learning because you kind of pause the reciting training for a while and then you train the value function some more just because that helps you also try a reward prediction in here and the way they do it as I explained is kind of in a skewed sampling way so how do all the situation's you can be in the agent will have a reward very very few times so what they do is they simply sample out of the replay buffer out of all the experiences they had so far they sample more frequently the experiences where they actually gotten a reward that way that the whole is of course the agent if you if you look at when you can zoom in here if you look at the the experience here where you actually get an apple and the agent might learn a lot faster or there's some kind of Apple there and I move towards get a reward so that's the the hope that you instantly recognize high reward situations and kanda are not so interested in non reward situations of course it doesn't reduce bias in your sampling and you might decide for yourself if that's good or bad here it seems to work so there's a lot of experiments in this task and labyrinth tasks and they of course as with research they read state of the art they're much better than anything else no I mean they don't boast as much so it's actually a fair comparisons the criticisms so they also evaluate a motor against the criticisms that I have are twofold first of all the choice of ability tasks is completely up to the implementer which means that I have to decide as an implementer of this algorithm what my Tillery tasks will be and here pixel changes and Network features they seem like fairly general tasks that you could apply to a lot of these kind of problems but it always kind of comes down to how much knowledge about the task would you like to go into the into the actor and here I mean you can see it makes sense to get at least the pixel changes as an auxiliary task but it's questionable how much of kind of domain knowledge this already encodes so the fact the choice of these are certainly something that you have to decide as a human and I think these are these are good choices so they're not too domain specific but also they do correspond to like some visual moving around game tasks and the other um kind of criticisms not really criticism is just a remark is that they do a lot of a lot of things so I mean the paper is about the auxiliary tasks but they also then do these skimmed sampling and the policy value learning and so on and of course you can kind of argue yeah this is all done you know the reinforcement learning tasks that's why it's a fair comparison I guess it's a philosophical question if you want to reach state of the art of course you have to first of all get a better a better method here this would be the auxiliary tasks this is the new idea and then implement all the tricks that the the other people have discovered which is good because you kind of reach the highest performance you can get but also the problem is you make it harder to compare you make it harder to see where the improvement is coming from have you simply chosen better high parameters for the reward predictions of things have you simply is there any interactions maybe between the auxiliary tasks and dispute sampling part all these kind of things wash out and it's not really clear where the improvement is coming from on the other hand if you simply take a basic basic basic algorithm like just a three see here on the top left and you augment it with nothing but these are the early tasks the bottom left then and then you see an improvement you can be relatively sure it's due to your new idea but of course you won't reach any state-of-the-art numbers because everyone that does a3 see also does these tricks philosophical question Here I am standing more on the side of not doing the tricks or maybe doing both yeah decide for yourself and have a nice day\n"