The Challenge of Offline Reinforcement Learning
Offline reinforcement learning is a rapidly growing field that involves training an agent to learn from a pre-collected dataset of experiences, without the need for online interaction with the environment. This approach has several advantages, including the ability to train on a large dataset and to improve performance over time through experience replay. However, this approach also presents several challenges, particularly in terms of collecting and curating high-quality data.
One of the key tasks that offline reinforcement learning must address is how to learn a policy to go as fast as possible from any point to any other point. This task requires the ability to navigate complex environments and to learn efficient paths through them. The agent is given a database of experience from another agent, but it does not have access to real-time feedback or guidance.
This task is similar to others that are presented in the dataset, including the grid world problem where the goal is to reach a target while avoiding obstacles, and more complex environments such as the MuJoCo environment and Aunt May's maze. The diversity of tasks and environments in this dataset highlights the challenges of offline reinforcement learning, particularly in terms of collecting and curating high-quality data.
In some cases, the replay buffer is constructed by human demonstrations, which can be a valuable source of data for learning. However, these demonstrations may not capture the full range of possible actions or behaviors that an agent might exhibit, particularly if they are limited to a small number of samples. For example, in the robotic hand manipulation tasks, the dataset consists of only 5,000 samples, which is likely to be insufficient to learn the complex dynamics and control policies required for this task.
To address these challenges, researchers have developed various algorithms and techniques, including inverse reinforcement learning, behavior cloning, and planning algorithms. Inverse reinforcement learning involves using a policy that has already learned to solve a problem, and then modifying it to achieve a new goal. Behavior cloning involves using demonstrations from humans or other agents to learn the behavior of an agent in a particular environment. Planning algorithms involve using a planner to generate a sequence of actions that can be executed by an agent in order to reach a target.
One example of a planning algorithm is star search, which can be used to find the shortest path through a maze. This algorithm is often more effective than traditional reinforcement learning approaches for complex tasks like this one. However, even with these powerful algorithms and techniques, offline reinforcement learning can still present several challenges, particularly in terms of collecting and curating high-quality data.
Most offline reinforcement learning algorithms do not work well on the datasets presented in this challenge, unless they were generated by a similar approach using a policy or other algorithm to collect experience. The dataset presents a challenging problem for researchers to develop algorithms that can handle these complex tasks with high success rates.
"WEBVTTKind: captionsLanguage: enhi there today we're looking at data sets for data driven reinforcement learning by Justin foo aviral Kumar Bhave our knock'em George Tucker and Sergey Lavine so this is a what you would call a data set paper or a benchmark paper and the main point or the main area of the paper is was called offline reinforcement learning so offline reinforcement learning you usually in reinforcement learning you have this task right you have the agent and you have the environment and the agent gets some sort of observation and it has to come up with an action in response to that observation and then it gets back a reward and another observation and again has to come up with an action and the goal is to maximize the rewards over time that the agent gets while interacting with the environment so usually this is organized in what are called episodes which basically means if you have some sort of environment right and here is the agent and here is the goal well if the goal is a inverted triangle and there are a bunch of walls right here right so it's it looks kind of a maze that the agent has to navigate then one episode areas could be the agent moving round until it either finds the target or hits a wall or it just kind of goes around and around and then at some point you say all right that's enough over game over and usually in reinforcement learning you perform many of these episodes and then you learn from them so you perform episodes and each episode gets into usually some sort of replay buffer right let's call this replay buffer and you do this many times and at the same time that you're doing this you're using the things that you stored here in order to learn write to the agent learns from these things right so it acts with the environment in this loop in this fashion then once it has done an episode it puts it into the reef replay buffer and then it learns from the actions it has performed this is what is usually called online reinforcement learning right so this loop is online online means because the agent learns from its own actions right now in contrast to this there is offline reinforcement learning so in offline reinforcement learning the agent has to learn from someone else's actions right so this connection here is severed instead you have other agents let's call this agent one agent to multiple agents agent three they all have their own interaction with the environment right environment environment interactions and they feed their experience into they perform these these episodes they feed their experience into the replay buffer and then the agent just has to learn from that so whatever happened here this was previous right and now the agent has to learn how to maximize its reward just from the experience that is in the replay buffer from these other agents this is what's called offline reinforcement learning means the agent learns from someone else's actions usually the the power of reinforcement learning of course comes from the fact that you learn from your own actions it means that for example if you already have some successful trajectories here right you found the target you can try to replicate that because you know which actions you performed and if you don't you know change anything you're probably going to find the target again just by randomness alright because you've done it already once and so on so you kinda know all the intrinsics of your own algorithm that led you to reach the target now this is an entirely different case with all of these other agents you have no clue how they were acting why they were acting right you just know okay they did a series of actions and that gave them some kind of reward and you have no idea what their reasoning was or anything all you really can learn from is their sequence of actions now why is that problematic right so if all of the agents for example if this is if this is a an actual platform and is really right steep here or this is all of here is really steep cliffs right and you you can actually fall off but the agents they are they're humans right so they don't want to fall off so what they're gonna do is they're just gonna take steps that are maybe like this or maybe like this but they're they're humans they're smart they're never they're never gonna they're never gonna fall off here right why is this a problem if you're not trying to learn from this experience and your policy by some chance because you might have some entropy in there or something you you know what happens if you make a move like this and you also know what happens if you make a move like this right already two humans have done these moves but what happens if you'd make a move like this you just don't know right in classic reinforcement learning you would get a negative reward and you could learn from that to not do this action anymore but in in this case you simply don't have any data to to to tell you what happens when you go off there so you see that there's a there's a problem if you are not able to learn from your own experience but you have to learn from some thing or someone else's experience the distribution of experience that you have available to you might be on like not not fully specific of the of the environment it might be very different from what you would do and it might be very not conducive to what you want to do with it so the task of offline reinforcement learning is harder than online reinforcement learning but it also has many many applications sometimes it's just not possible to do online reinforcement learning when for example in medical field right think of think of the medical field where you want a robot to perform a surgery you can't just do reinforcement learning in with our online techniques because they're just gonna try a bunch of things and see what works and you don't you don't like maybe you want that I don't want that so necessarily you're going to be left with let's have this robot learn from human experts right so that's a task for offline reinforcement learning there are many more tasks for example if you think of search engine you will have many many many logs from human searching things and you just simply store them you simply have them in a buffer now you wanna maybe train a reinforcement learning agent that I don't know serves the best possible ads or something like this you want to do this in a way that you can use all of that data even though that data wasn't collected by that particular agent right there's the crucial difference to supervised learning again is that you have this this interactive structure right this multi-step interactive structure because in a supervised learning you also have you also have this buffer here right in supervised learning you simply have your labeled data set but the difference is in supervised learning you always know what the right action is currently right because you have the labels in offline reinforcement learning you don't know you right you might be here and there are three actions available right and if you all you know is that do the the demonstrator a these actors here one of them has done this and then this and then this and then got a - you have you have no clue what happens if you do this and then this and then this right all you know is that this action here might eventually lead to a - right and you also can't try it out because you can't try out this path because you don't get a reward here you have to find and this is the the task here you'll have to find some other example or stitch together they make a good example here so this paper basically proposes a benchmark for offline RL algorithms so what they do is they have a bunch of datasets right they have a bunch of these replay buffers around for different tasks they a collection of this that they collected with various techniques so there is human demonstration there is other agents and so on they have that and you're supposed to take one of them learn something learn an agent and then evaluate it on an environment right and they propose which ones are suitable for this they give you the data and they give you the environment to evaluate it on in the end you'll get a score and you can compare your offline RL algorithm with others they also provide some benchmark implementations for algorithms that already do this and they show that they don't really work well so one of the one of the tasks is this maze here in this maze you'll the the task is you are somewhere let's say here and you need to go somewhere let's say here and you need to find your way right and the demonstrations you have the data in your replay buffer or is such that this is the same task but never the same start and end points like you are tasked so you might have one in your replay buffer you might have one trajectory one episode that went like this from one to two right and you'll be able to see the reward of that and you might have one trajectory that was from two to three like this right so both of these things actually give you really high reward so if you were an agent right and you had to learn and now the task is please go from one to three what you could do is you could simply say ah I know I know the green thing gave a pretty high reward and the yellow thing grave a pretty high reward I know the green sting things started at one and I know the yellow thing ended at three and I know they both have this common location so what I might do just is I might go to that common location and then go on on the different path right see you have to somehow stitch together experience from other agents in order to make your task work now this is a very explicit example of course what we want to do is we want to do this in a more implicit deep burning way ideally and not manually stitch together other trajectories so I'm pretty sure that would not that would not be so so dumb right I'm pretty sure there's a lot of data augmentation you could do during training simply by stitching together other other trajectories right so from this trajectory you could actually not only could you make other gold conditioned ways for example from here to here or from here to here you could make from here to here anywhere where you have shared points you could go um you could train a policy that goes there and then goes further or something like this I'm pretty sure there's already an algorithm that does things like this but I'm just thinking aloud here all right so this is one of the tasks and you see that the the that that you will have to learn a policy to go as fast as possible from any point to any other point and you're all you're given is a database of experience that already exists from some other agent but never will probably never the exact route that you need to learn right now all right so the goal is how fast or how efficiently can you do this this is one task in this data set the next task is very similar is this grid world here where there is this red square a red triangle that's your agent and then there is the green square that's your goal or vice versa and so you're basically tasked to not hit the walls here and and to go about your your way finding the target there are more elaborate things like this mu Joko environment here or the Aunt May's where you have this little ant with you know they're the spider legs so this is no longer you can just move in an either direction you have to actually control the legs and there's also this arm this robotic arm so you see there is a wide diversity of tasks and also there is a wide diversity of how the replay buffer was constructed so in some cases the replay buffer is actually constructed by a human performing in this environment so in this hand in this hand manipulation tasks you'll have demonstrations from humans you see it's not particularly many samples here it's a 5,000 samples which I guess are are is a chopped-up version of I'm not really sure how the human things are constructed but you can clearly guess that the degrees of freedom that you have in a robotic hand is much much higher than you could learn just from these 5,000 samples if you were to you know an online RL algorithm that just does random exploration will need much more than these 5,000 samples and the 5,000 samples won't be iid distributed with all the degrees of freedom it will just be here's what a human does right and so so you can think of algorithms like inverse reinforcement learning or something like this but here in inverse reinforcement learning usually you assume that the expert the expert is kind of trying to achieve the same reward as you do but this is not necessarily the case here you have a given reward structure but you are tasked to simply learn from these demonstrations you can see it's also possible that there is this is constructed by a policy and that usually means that they so either it's um it's constructed by let's say a reinforcement learning algorithm that was trained in an online fashion but maybe not as well but also I think they have behavior cloning policy that they got from human demonstration I think so that there are many ways also sometimes you have a planner which is can you imagine it's it's a it's an algorithm that wasn't machine learned so I know almost unthinkable but in these in these kind of mazes you can actually do planning algorithms that can can sort of so I know this is crazy and crazy talk the niche topic but there exist things like a star search where where you can construct the kind of shortest path through these mazes and things like this so yeah that's I know I know that that is that's very niche but you can construct policies like this and then you can use those as your replay buffer filling and you can already see that this also will be a massively different distribution of data then you would get with an online RL algorithm right so in conclusion they do test other they do test other algorithms on this in conclusion they say that most offline our algorithms nowadays they don't work well on these on these datasets the only datasets where they do work well is where the replay buffer was generated by some sort of like here by some sort of policy by some sort of reinforcement learning policy so what they would do is they would train an online policy and the experience generated by that online policy while it learns will make up the replay buffer and if you use that replay buffer for offline learning then they say it tends to work okay but if you have other methods of collecting the data that are very different from this offline sorry from an from a reinforcement learning collection approach then it tends not to work as well all right so if you are interested in offline RL please check out this paper all their code is available right here note that the link in the paper doesn't seem to work the true link is here I'll also put it in the description and with that I wish you good day byehi there today we're looking at data sets for data driven reinforcement learning by Justin foo aviral Kumar Bhave our knock'em George Tucker and Sergey Lavine so this is a what you would call a data set paper or a benchmark paper and the main point or the main area of the paper is was called offline reinforcement learning so offline reinforcement learning you usually in reinforcement learning you have this task right you have the agent and you have the environment and the agent gets some sort of observation and it has to come up with an action in response to that observation and then it gets back a reward and another observation and again has to come up with an action and the goal is to maximize the rewards over time that the agent gets while interacting with the environment so usually this is organized in what are called episodes which basically means if you have some sort of environment right and here is the agent and here is the goal well if the goal is a inverted triangle and there are a bunch of walls right here right so it's it looks kind of a maze that the agent has to navigate then one episode areas could be the agent moving round until it either finds the target or hits a wall or it just kind of goes around and around and then at some point you say all right that's enough over game over and usually in reinforcement learning you perform many of these episodes and then you learn from them so you perform episodes and each episode gets into usually some sort of replay buffer right let's call this replay buffer and you do this many times and at the same time that you're doing this you're using the things that you stored here in order to learn write to the agent learns from these things right so it acts with the environment in this loop in this fashion then once it has done an episode it puts it into the reef replay buffer and then it learns from the actions it has performed this is what is usually called online reinforcement learning right so this loop is online online means because the agent learns from its own actions right now in contrast to this there is offline reinforcement learning so in offline reinforcement learning the agent has to learn from someone else's actions right so this connection here is severed instead you have other agents let's call this agent one agent to multiple agents agent three they all have their own interaction with the environment right environment environment interactions and they feed their experience into they perform these these episodes they feed their experience into the replay buffer and then the agent just has to learn from that so whatever happened here this was previous right and now the agent has to learn how to maximize its reward just from the experience that is in the replay buffer from these other agents this is what's called offline reinforcement learning means the agent learns from someone else's actions usually the the power of reinforcement learning of course comes from the fact that you learn from your own actions it means that for example if you already have some successful trajectories here right you found the target you can try to replicate that because you know which actions you performed and if you don't you know change anything you're probably going to find the target again just by randomness alright because you've done it already once and so on so you kinda know all the intrinsics of your own algorithm that led you to reach the target now this is an entirely different case with all of these other agents you have no clue how they were acting why they were acting right you just know okay they did a series of actions and that gave them some kind of reward and you have no idea what their reasoning was or anything all you really can learn from is their sequence of actions now why is that problematic right so if all of the agents for example if this is if this is a an actual platform and is really right steep here or this is all of here is really steep cliffs right and you you can actually fall off but the agents they are they're humans right so they don't want to fall off so what they're gonna do is they're just gonna take steps that are maybe like this or maybe like this but they're they're humans they're smart they're never they're never gonna they're never gonna fall off here right why is this a problem if you're not trying to learn from this experience and your policy by some chance because you might have some entropy in there or something you you know what happens if you make a move like this and you also know what happens if you make a move like this right already two humans have done these moves but what happens if you'd make a move like this you just don't know right in classic reinforcement learning you would get a negative reward and you could learn from that to not do this action anymore but in in this case you simply don't have any data to to to tell you what happens when you go off there so you see that there's a there's a problem if you are not able to learn from your own experience but you have to learn from some thing or someone else's experience the distribution of experience that you have available to you might be on like not not fully specific of the of the environment it might be very different from what you would do and it might be very not conducive to what you want to do with it so the task of offline reinforcement learning is harder than online reinforcement learning but it also has many many applications sometimes it's just not possible to do online reinforcement learning when for example in medical field right think of think of the medical field where you want a robot to perform a surgery you can't just do reinforcement learning in with our online techniques because they're just gonna try a bunch of things and see what works and you don't you don't like maybe you want that I don't want that so necessarily you're going to be left with let's have this robot learn from human experts right so that's a task for offline reinforcement learning there are many more tasks for example if you think of search engine you will have many many many logs from human searching things and you just simply store them you simply have them in a buffer now you wanna maybe train a reinforcement learning agent that I don't know serves the best possible ads or something like this you want to do this in a way that you can use all of that data even though that data wasn't collected by that particular agent right there's the crucial difference to supervised learning again is that you have this this interactive structure right this multi-step interactive structure because in a supervised learning you also have you also have this buffer here right in supervised learning you simply have your labeled data set but the difference is in supervised learning you always know what the right action is currently right because you have the labels in offline reinforcement learning you don't know you right you might be here and there are three actions available right and if you all you know is that do the the demonstrator a these actors here one of them has done this and then this and then this and then got a - you have you have no clue what happens if you do this and then this and then this right all you know is that this action here might eventually lead to a - right and you also can't try it out because you can't try out this path because you don't get a reward here you have to find and this is the the task here you'll have to find some other example or stitch together they make a good example here so this paper basically proposes a benchmark for offline RL algorithms so what they do is they have a bunch of datasets right they have a bunch of these replay buffers around for different tasks they a collection of this that they collected with various techniques so there is human demonstration there is other agents and so on they have that and you're supposed to take one of them learn something learn an agent and then evaluate it on an environment right and they propose which ones are suitable for this they give you the data and they give you the environment to evaluate it on in the end you'll get a score and you can compare your offline RL algorithm with others they also provide some benchmark implementations for algorithms that already do this and they show that they don't really work well so one of the one of the tasks is this maze here in this maze you'll the the task is you are somewhere let's say here and you need to go somewhere let's say here and you need to find your way right and the demonstrations you have the data in your replay buffer or is such that this is the same task but never the same start and end points like you are tasked so you might have one in your replay buffer you might have one trajectory one episode that went like this from one to two right and you'll be able to see the reward of that and you might have one trajectory that was from two to three like this right so both of these things actually give you really high reward so if you were an agent right and you had to learn and now the task is please go from one to three what you could do is you could simply say ah I know I know the green thing gave a pretty high reward and the yellow thing grave a pretty high reward I know the green sting things started at one and I know the yellow thing ended at three and I know they both have this common location so what I might do just is I might go to that common location and then go on on the different path right see you have to somehow stitch together experience from other agents in order to make your task work now this is a very explicit example of course what we want to do is we want to do this in a more implicit deep burning way ideally and not manually stitch together other trajectories so I'm pretty sure that would not that would not be so so dumb right I'm pretty sure there's a lot of data augmentation you could do during training simply by stitching together other other trajectories right so from this trajectory you could actually not only could you make other gold conditioned ways for example from here to here or from here to here you could make from here to here anywhere where you have shared points you could go um you could train a policy that goes there and then goes further or something like this I'm pretty sure there's already an algorithm that does things like this but I'm just thinking aloud here all right so this is one of the tasks and you see that the the that that you will have to learn a policy to go as fast as possible from any point to any other point and you're all you're given is a database of experience that already exists from some other agent but never will probably never the exact route that you need to learn right now all right so the goal is how fast or how efficiently can you do this this is one task in this data set the next task is very similar is this grid world here where there is this red square a red triangle that's your agent and then there is the green square that's your goal or vice versa and so you're basically tasked to not hit the walls here and and to go about your your way finding the target there are more elaborate things like this mu Joko environment here or the Aunt May's where you have this little ant with you know they're the spider legs so this is no longer you can just move in an either direction you have to actually control the legs and there's also this arm this robotic arm so you see there is a wide diversity of tasks and also there is a wide diversity of how the replay buffer was constructed so in some cases the replay buffer is actually constructed by a human performing in this environment so in this hand in this hand manipulation tasks you'll have demonstrations from humans you see it's not particularly many samples here it's a 5,000 samples which I guess are are is a chopped-up version of I'm not really sure how the human things are constructed but you can clearly guess that the degrees of freedom that you have in a robotic hand is much much higher than you could learn just from these 5,000 samples if you were to you know an online RL algorithm that just does random exploration will need much more than these 5,000 samples and the 5,000 samples won't be iid distributed with all the degrees of freedom it will just be here's what a human does right and so so you can think of algorithms like inverse reinforcement learning or something like this but here in inverse reinforcement learning usually you assume that the expert the expert is kind of trying to achieve the same reward as you do but this is not necessarily the case here you have a given reward structure but you are tasked to simply learn from these demonstrations you can see it's also possible that there is this is constructed by a policy and that usually means that they so either it's um it's constructed by let's say a reinforcement learning algorithm that was trained in an online fashion but maybe not as well but also I think they have behavior cloning policy that they got from human demonstration I think so that there are many ways also sometimes you have a planner which is can you imagine it's it's a it's an algorithm that wasn't machine learned so I know almost unthinkable but in these in these kind of mazes you can actually do planning algorithms that can can sort of so I know this is crazy and crazy talk the niche topic but there exist things like a star search where where you can construct the kind of shortest path through these mazes and things like this so yeah that's I know I know that that is that's very niche but you can construct policies like this and then you can use those as your replay buffer filling and you can already see that this also will be a massively different distribution of data then you would get with an online RL algorithm right so in conclusion they do test other they do test other algorithms on this in conclusion they say that most offline our algorithms nowadays they don't work well on these on these datasets the only datasets where they do work well is where the replay buffer was generated by some sort of like here by some sort of policy by some sort of reinforcement learning policy so what they would do is they would train an online policy and the experience generated by that online policy while it learns will make up the replay buffer and if you use that replay buffer for offline learning then they say it tends to work okay but if you have other methods of collecting the data that are very different from this offline sorry from an from a reinforcement learning collection approach then it tends not to work as well all right so if you are interested in offline RL please check out this paper all their code is available right here note that the link in the paper doesn't seem to work the true link is here I'll also put it in the description and with that I wish you good day bye\n"