Thinking While Moving - Deep Reinforcement Learning with Concurrent Control

# Thinking While Moving: Deep Reinforcement Learning with Concurrent Control

## Introduction

In this video, we explore the concept of reinforcement learning (RL) in the context of robotic control. The video demonstrates two robots performing tasks: one labeled "blocking" and another labeled "concurrent." The key difference between these two robots lies in how they process information and execute actions.

The blocking robot pauses its movement after each action to register the state, compute a new action, and then continues moving. This creates noticeable gaps in its motion. On the other hand, the concurrent robot executes actions continuously without stopping, resulting in fluid, uninterrupted movement. The video explains that this difference arises from how reinforcement learning is traditionally formulated versus a new approach called "thinking while moving."

## Classic Reinforcement Learning

In classic RL, there is a dichotomy between the agent and the environment. The environment sends an observation to the agent, which then decides on an action based on its policy (a function that maps states to actions). This process assumes that time is frozen during state registration, action computation, and communication between the agent and environment.

For example, in a typical RL setup like OpenAI Gym, the environment stops until it receives the next action from the agent. Once the action is received, the environment executes it instantly, changing the state and providing feedback. This process repeats, with time only moving forward when actions are executed.

The video highlights that this traditional formulation of RL does not account for the real-world dynamics where processing observations and computing actions take time. The robot on the left (blocking) exemplifies this limitation by freezing after each action to process information before executing the next one.

## Limitations of Classic RL

One major limitation of classic RL is its inability to handle concurrent processing of states and actions. In the real world, the environment does not stop while the agent processes information; instead, it continues to evolve. This discrepancy leads to issues where the agent's decisions are based on outdated information because the world has already changed by the time the agent acts.

The video illustrates this problem with an example: if the agent bases its decision on a state observed at time T but executes the action later, the environment may have changed in the meantime. This delay can lead to suboptimal or even incorrect actions, as the agent's knowledge of the world is outdated.

## New Formulation: Thinking While Moving

To address these limitations, the video introduces a new formulation of RL called "thinking while moving." In this framework, the agent processes observations and computes actions concurrently with executing them. This means that the robot can continue moving smoothly without pausing to process information after each action.

The key idea is that the robot does not freeze after an action; instead, it starts processing the next state while still executing the current action. This overlapping of tasks allows for continuous movement and more efficient interaction with the environment. The video explains that this requires a new mathematical framework that models the agent's decision-making process in a continuous time setting.

## Technical Details: Continuous Framework

The video delves into the technical aspects of the new formulation. It introduces differential equations to model how the environment changes over time based on two functions, F and G. Function F represents the deterministic part of the state transition, while G introduces stochasticity, simulating real-world unpredictability.

The reward function is also redefined in this continuous framework. Instead of discrete rewards for each action, the agent receives a continuous stream of rewards as it interacts with the environment. This shift from discrete to continuous time requires rethinking how value functions and Q-functions are calculated.

A significant concept introduced is the "Bellman operator," which is used to update the Q-function in RL. In this new framework, the Bellman operator still serves as a contraction mapping, ensuring that repeated applications of it converge to an optimal Q-function. However, the continuous nature of the problem requires adjustments to how these operators are applied.

## The Vector-to-Go Mechanism

Another important innovation discussed is the "vector-to-go" mechanism. This concept provides the agent with information about the progress of its current action when transitioning between states. For example, if the robot is moving its arm from one position to another, the vector-to-go informs it how much of the action has been completed and what remains.

This additional information allows the agent to make more informed decisions by considering the ongoing state of actions. Without this mechanism, the agent would have to infer the progress of actions through differences in states, which can be less reliable.

## Results and Experiments

The video presents experimental results from both simulated robotics and real-world robotic grasping tasks. In these experiments, the concurrent control framework shows improved performance compared to traditional blocking approaches. For instance, in a grasping task, the concurrent robot achieves 92% success rate, whereas non-blocking methods see reduced success rates.

Additionally, the video highlights that while concurrent control improves performance, it also reduces the duration of episodes. This trade-off between success and speed is worth considering depending on the specific application's requirements.

## Conclusion

The video concludes by emphasizing the potential of this new formulation of RL in enabling more efficient and fluid robotic movements. It invites viewers to explore the full paper and supplementary materials, which include diagrams, ablation studies, and additional results. The framework presented has significant implications for real-world applications where continuous interaction with dynamic environments is crucial.

By addressing the limitations of traditional RL and introducing novel concepts like concurrent processing and vector-to-go mechanisms, this research opens new possibilities for developing more sophisticated and adaptive robotic systems.

"WEBVTTKind: captionsLanguage: enhi there so if you look at these two robots the left one labeled blocking the right one labeled concurrent the blocking robot as you can see always has these little pauses in its movement where it does nothing and then it kind of continues with its motion while the one on the right is one continuous motion that it does so the reasoning here is that the robot has a camera and the camera takes some time to register what's going on and then also the robot has a computer inside and the computer also takes some time to decide what to do based on what the camera saw and while all of this is happening the robot on the left just freezes so it performs an action and then it freezes because it takes time to register a state and compute a new action whereas the robot on the right it takes the same amount of time to do these things right it also takes a time to register the state and the computed action but it does that as it is executing the last action so it does this in parallel and then it executes the action once it's computed a new action it executes that new action right on top of the old action and that gives this one big fluid motion so this requires a new formulation of reinforcement learning and that's what this paper does thinking while moving deep reinforcement learning with concurrent control by people from Google brain UC Berkeley and X so I they have a nice diagram here in the supplementary material to show you what is going on in their framework so in classic reinforcement learning right here in classic reinforcement learning you have this dichotomy between agent and environment right so the agent and the environment now the agent is supposed to kind of act in the environment in the following manner the environment will send an observation to the agent the observation in this case is the picture of the of the camera so it sends an observation and the agent will say what to do with the observation which is called a policy the policy PI will take an observation and and then output an action of what to do and will send back the action to that environment right so in classic RL you assume that this part here is kind of freezes time so the environment will output an observation and as and the process of registering the observation of computing the action and of sending the action back is happens in zero time of course it doesn't happen in zero time but in our reinforcement learning problems for example the open a I Jim the environment just stops until it gets the next action and then it performs the action right it performs the action in environment and by that the environment changes and time happens and then it stops again as we think of the next action right so this is we usually call this one step in the in the kind of classic formulation of RL the only point that time happens is when the action is executed no time happens when the state is registered or when the action is computed and that's what you see here on the left so in blue you have the state registration this is for example the camera the camera has some time in order to register and store the image that it has taken right maybe post process it a little bit so that's what the camera does but in our classic formulation as you can see here if this is time it happens instantaneously all at the same time and also this is the policy this is thinking what to do right this is your evaluation of your neural network if this is a neural network drawing a small neural network here this happens instantaneously in these formulations and only as the action is executed time happens right and then until here time freezes again and only once the action is determined time happens again in the new formulation now as you've already seen what what we have here is this this kind of continuous framework where let's say you're here it actually takes time for the camera to post process the image it takes more time for you to think about what to do and then once you decide on an action that action is going to happen right but we can we can say for example at this point you tell the camera to take a new picture of the state right but that takes time and while that's happening the old action is still ongoing right so you don't even even have to say the action is still ongoing but the world is still moving right the world is still changing while you think while you post process and while you evaluate your policy the world is thinking and only after some time right after this lag time here have you decided on a new action and then you can break that old action and kind of perform your new action and all of this is happening in time so this this is the new framework now you see the problem here the problem is that you base your decisions on the state and time T on time sorry time age here you base your decisions on the state as it was at that time right that's what you use to think right here that's what you store and think about but you perform so you perform the action at this point in time so there is a considerable difference here because the world has now changed so you see the problem the action you perform is based on an old knowledge of the world and you have basically no way of making the action dependent on the current state of the world because that would require you to capture the current state and that takes time and in that time the world has already shifted again so the agent is kind of required to think ahead about the action that it is currently performing and how the world changes according to that so this new formulation of reinforcement learning formulates this in an in a formal way it formulates it in a formal way and we'll quickly go through that yes so they go into the very basics here we'll quickly quickly go through them so they introduce these quantities like the policy PI the transition distribution the reward and the queue and value function now we'll just quickly go over these so you have the agent sorry about that you have the agent and you have the environment hello so if you think of the agent and the environment the environment has this transition function transition function the transition function it takes it says okay I'm in this state and the agent does this action and here's the probability distribution over the next state right so it says that your little spaceship is here right and the meteors are here and then you push the button if if you push the button for shoot then you will be in the same place the meteors will still be here but you'll have a little shot coming out of your spaceship that's what the environment does right so you give it a state and action and it will give you the next state it will also give you a reward right the reward in the same thing the reward here will be a second output here that tells you either let's say negative one if you're you die or zero if nothing happens or +1 if you shoot a meteor that's the reward so this you can think of as the real world so these two quantities are in the real world in the environment so that's how you model the environment then the agent has these quantities called the policy pi so pi what pi does is much like the transition but pi takes in a state and gives you an action right so this is now the agent deciding this is thinking the policy takes in a state and gives an action and this this can takes various forms but it's just a function for now the agent also has a Q and a v function and these are quite quite similar so the Q function what the Q function will do if if you are in a state and you have several options of what to do right you have action one action to and action three right in your in state s the Q function of s and a 1 superscript PI would tell you the following it would tell you what's my expected reward if I'm in state s that's here and perform action a so a 1 so if I now take this path and after this path I follow the policy pi write the policy PI for each of the following so it's like right now I take action a 1 I don't care about my policy but after that I follow the policy PI what is my expected reward going to be until the end of the episode that's the Q function and the value function here very similar but it only cares about the state it says if I'm in state s and I just follow the policy PI even in the first step right I just follow this policy pi what is my expected reward going to be over the course of the episode that is the Q and the value functions you can see why Q learning is popular if you have a good Q function and the Q and the value function these are the things that you actually want to learn right if you have a good Q function you can simply always plug in every action into your Q function and then simply take the maximum the action that has the maximum Q value because that will can that will give you the best reward if if your policy PI right it's kind of self referential if your policy to is to always take the maximum Q value then taking the maximum Q value with the policy given that you take the maximum Q value will be optimal all right this was very convoluted but alright so let's let's start off with modeling the environment in this continuous framework so instead of having the next state be determined by the current state in action in the continuous framework they do this via differential equation so the DS is how does the environment change this is the change in the environment that is determined by two functions F and G so f is your classic environment function it takes in a state and an action at time T right these are not functions and it will it will output how the state changes and the G here is this is a wiener process is to introduce stochasticity as I understand it because in the classic formulation the transition model gives you a probability up here a probability distribution so this Wiener process is responsible for introducing that probabilistic nature into this differential equation but ultimately it simply tells you how does the state change depending on my state current state and action that I perform so the reward function now is also pretty simple the Tao here is a trajectory and the trajectory is simply the state and action over time so if I integrate from time zero to infinity or to the end of the episode my reward function at each point in time right so I go through my episode and I get higher reward not so high and so on so the integral under this curve will be my total reward just like we sum up the reward of individual steps in the discrete case in the continuous case you can think of each infinitesimal time step giving you a tiny bit of reward so the entire reward is just an integral then we go on the value function for a given state at time T right so think about what this is the value function for a state means what reward can I expect starting in this particular state and then following policy PI until the end of the episode and that here is the expectation over all trajectories that come from my policy of the reward in that trajectory so I can you know if I'm here my policy now is also a distribution it can go multiple trajectories right and I want to I want to have the expected value of the reward so each one of these has a reward the expected value of the reward over all trajectories starting from state s T and again here you say that that is the integral over T now here I have a bit of a problem because here they say T equals zero going from here and here but here the T is already here so I leave this should be this should be T equals Prime and then T prime T prime T prime up here and t minus T prime or something like this in any case I think it should it should actually start from this state here and not from time zero but I might be missing something I'm not the biggest integrator in the world so you know alright then you have the Q function now think of it what the Q function is in the discrete case the Q function tells you if I'm in state s and perform action a what is my expected reward going to be that I have to introduce some different things here they say if I'm in state s and I act action a at time T until time H right now you have to say how long you're gonna perform the action for until you perform the next action right so H is your Ural act I'm here until you perform the next action so this now I actually agree with this formulation with the integral here so this is going to be the integral from time T to time T plus h that's how it long you perform the action your reward of performing that action right given the state plus the value function at the end of that so you're ear here you're an ST and you perform action a right and then this is your state at time T plus h and then you're here and from there on you could perform many many many actions right but in the original notion of the Q function the Q function tells you if I'm here and I perform this action and after that I act according to pot to a policy PI what is my what is my expected reward and but there's a classic recurrence relation in reinforced learning where you can say the Q function in st given to a is the reward that I get from performing a in state s plus the value function at state s at the stat the next state right because the value function is exactly the reward that you would get by following policy PI in that next state and the Q function means I perform a now and after that I perform PI so this is the continuous analog that's why you have this part here where you perform the action for each time and after each time you just go after go with your policy and that will be the value function so this is the continuous formulation of the of the problem right and now they can introduce these these lagging times so in their diagram up here they define these notions so you have your state s T right here then after this time you capture the new state right so after that time you capture the new state and decide on an action and then you perform it for each time is that correct until here so the the the I minus one of action is performed at this time and the I F action is performed at this time no that makes no sense so let's read it ah so this is when you capture the state and you need time to perform to think right this is thinking and then you perform this action at that time this is the lag time now and you perform this action you want to know you want to know if I perform this action until this time here what is what is happening so this is the new cue function takes into account this thing it tells you if I'm in state s and I think this is thinking leads me to here this is the old action right this is the old action that's still happening well I observe this state right so it means if I do this right now and after thinking I do this right so I'm at state I'm at time T and this is still happening and then after I think thinking leads me here T plus TAS I perform this new action a lot of colors I perform this new action at that point until time H what's my cue function so my cue function is going to be the integral time T where I start observing the state and start thinking until T plus TAS that's when I still perform the old action right so this is going to read the reward in the state given the old action and then at that time I switch over to the new action right so at that time until time age now I perform the new action so this entire part here this part until here is taking the place of this first part here in the cue function of this first part right so because before it was simply executing one action we didn't have this concurrency yet so executing the action and after that it's going to be the value function and now it's executing two actions first execute the old action then once you're done thinking execute the new action and then it's the value function from their own I hope this is clear it wasn't clear to me until just now as well alright so they define the Monte Carlo estimator where you can do this with just samples of the of trajectories instead of expectations and then they define the bellmen operator the bellmen backup operator now the bellman back-up operator is an important quantity in value-based reinforcement learning because the bellman backup operators basically what I talked about before it's it tells you that if your policy is them to always select the maximum the action with the maximum Q value right that's what's down here after you do this action then the then the policy you arrive at and you can give certain optimality guarantees but in essence this is clear called a contraction so if you always do that and you calculate your Q function that way it will mean that in the contraction is defined as if you have an operator if you have two things that are X 1 and X 2 that are some apart from each other then after you apply the operator this T here X 1 minus tx.2 they will be closer together which basically means that the q TQ functions of the individual states will be closer together and you'll converge to a single Q function right so given enough time and enough data you'll converge on one q function there's one fixed point q function that you'll converge to and you can show under assumption in classic RL that this is going to be the the optimal Q function the the true let's say Q function so they first prove this and then they prove it now they go back to discrete time so now they were in continuous time they go back to discrete time but now they have a discrete time formulation with this lag here and also they prove that that bellman operator is a contraction so the contraction part basically means that if you perform Q learning you're going to arrive at a solution that's what it means to be contraction but now obviously that solution in classic RL is going to be the the optimal Q function but here I actually don't know all right so they try this out and they introduce one last important concept here what they call a vector to go which basically means that at the point where they start thinking where is a good thing to show this at the point where they start thinking they give a they give the the last action with so at this point right here where they sorry where they capture the state they also sort of the state contains a information about what part of the action that you started here is still outstanding so maybe your action was and they illustrate this down here maybe your action was to move your robot arm from down here to up here right that was your plan to action at this point in time now if you are at step if you perform the action here and here you start capturing the next state then you would also give this particular vector here to the to that to the agent so not only will you tell it hey by the way my last action was a t minus one as you would need in the q-value you will also say and this much is outstanding this is much is whereas I I still have to do that much so basically you're saying I wanted to move my arm right here and I still have to do this part of the action now you can see while the algorithm is able to learn much better given that information because otherwise it has it would have to basically infer that vector from kind of differencing the action - the what probably happened in the meantime right so they test this out and what results is the robot videos you've seen before where they say they can recover the original the original q-learning in this continuous framework so here on the left side you have blocking actions and it says when it says yes here it is kind of the old old framework you see the grasp success at like 92% where as if you go to non-blocking actions but do none of the none of the concurrent information the grasp success suffers but you can recover the grasp success if you if you give these concurrent information like even introduce a time step penalty and you give this vector to go and the information about the previous action you can also see that the episode duration here is much lower when you go for the continuous actions then when you are in the old framework naturally because you don't need to pause right in this so this is the simulated robotics and the real world robotic grasping results you see kind of similar results in that if you do have blocking actions your grasp success is higher than if you don't but your duration of your of your policy is cut in half so maybe this is a trade-off worth considering I think this is a is pretty cool framework and I think there's going to be a lot of work still outstanding here and I invite you to check out the paper and look at their videos and their ablation studies of what's important and what not and with that bye byehi there so if you look at these two robots the left one labeled blocking the right one labeled concurrent the blocking robot as you can see always has these little pauses in its movement where it does nothing and then it kind of continues with its motion while the one on the right is one continuous motion that it does so the reasoning here is that the robot has a camera and the camera takes some time to register what's going on and then also the robot has a computer inside and the computer also takes some time to decide what to do based on what the camera saw and while all of this is happening the robot on the left just freezes so it performs an action and then it freezes because it takes time to register a state and compute a new action whereas the robot on the right it takes the same amount of time to do these things right it also takes a time to register the state and the computed action but it does that as it is executing the last action so it does this in parallel and then it executes the action once it's computed a new action it executes that new action right on top of the old action and that gives this one big fluid motion so this requires a new formulation of reinforcement learning and that's what this paper does thinking while moving deep reinforcement learning with concurrent control by people from Google brain UC Berkeley and X so I they have a nice diagram here in the supplementary material to show you what is going on in their framework so in classic reinforcement learning right here in classic reinforcement learning you have this dichotomy between agent and environment right so the agent and the environment now the agent is supposed to kind of act in the environment in the following manner the environment will send an observation to the agent the observation in this case is the picture of the of the camera so it sends an observation and the agent will say what to do with the observation which is called a policy the policy PI will take an observation and and then output an action of what to do and will send back the action to that environment right so in classic RL you assume that this part here is kind of freezes time so the environment will output an observation and as and the process of registering the observation of computing the action and of sending the action back is happens in zero time of course it doesn't happen in zero time but in our reinforcement learning problems for example the open a I Jim the environment just stops until it gets the next action and then it performs the action right it performs the action in environment and by that the environment changes and time happens and then it stops again as we think of the next action right so this is we usually call this one step in the in the kind of classic formulation of RL the only point that time happens is when the action is executed no time happens when the state is registered or when the action is computed and that's what you see here on the left so in blue you have the state registration this is for example the camera the camera has some time in order to register and store the image that it has taken right maybe post process it a little bit so that's what the camera does but in our classic formulation as you can see here if this is time it happens instantaneously all at the same time and also this is the policy this is thinking what to do right this is your evaluation of your neural network if this is a neural network drawing a small neural network here this happens instantaneously in these formulations and only as the action is executed time happens right and then until here time freezes again and only once the action is determined time happens again in the new formulation now as you've already seen what what we have here is this this kind of continuous framework where let's say you're here it actually takes time for the camera to post process the image it takes more time for you to think about what to do and then once you decide on an action that action is going to happen right but we can we can say for example at this point you tell the camera to take a new picture of the state right but that takes time and while that's happening the old action is still ongoing right so you don't even even have to say the action is still ongoing but the world is still moving right the world is still changing while you think while you post process and while you evaluate your policy the world is thinking and only after some time right after this lag time here have you decided on a new action and then you can break that old action and kind of perform your new action and all of this is happening in time so this this is the new framework now you see the problem here the problem is that you base your decisions on the state and time T on time sorry time age here you base your decisions on the state as it was at that time right that's what you use to think right here that's what you store and think about but you perform so you perform the action at this point in time so there is a considerable difference here because the world has now changed so you see the problem the action you perform is based on an old knowledge of the world and you have basically no way of making the action dependent on the current state of the world because that would require you to capture the current state and that takes time and in that time the world has already shifted again so the agent is kind of required to think ahead about the action that it is currently performing and how the world changes according to that so this new formulation of reinforcement learning formulates this in an in a formal way it formulates it in a formal way and we'll quickly go through that yes so they go into the very basics here we'll quickly quickly go through them so they introduce these quantities like the policy PI the transition distribution the reward and the queue and value function now we'll just quickly go over these so you have the agent sorry about that you have the agent and you have the environment hello so if you think of the agent and the environment the environment has this transition function transition function the transition function it takes it says okay I'm in this state and the agent does this action and here's the probability distribution over the next state right so it says that your little spaceship is here right and the meteors are here and then you push the button if if you push the button for shoot then you will be in the same place the meteors will still be here but you'll have a little shot coming out of your spaceship that's what the environment does right so you give it a state and action and it will give you the next state it will also give you a reward right the reward in the same thing the reward here will be a second output here that tells you either let's say negative one if you're you die or zero if nothing happens or +1 if you shoot a meteor that's the reward so this you can think of as the real world so these two quantities are in the real world in the environment so that's how you model the environment then the agent has these quantities called the policy pi so pi what pi does is much like the transition but pi takes in a state and gives you an action right so this is now the agent deciding this is thinking the policy takes in a state and gives an action and this this can takes various forms but it's just a function for now the agent also has a Q and a v function and these are quite quite similar so the Q function what the Q function will do if if you are in a state and you have several options of what to do right you have action one action to and action three right in your in state s the Q function of s and a 1 superscript PI would tell you the following it would tell you what's my expected reward if I'm in state s that's here and perform action a so a 1 so if I now take this path and after this path I follow the policy pi write the policy PI for each of the following so it's like right now I take action a 1 I don't care about my policy but after that I follow the policy PI what is my expected reward going to be until the end of the episode that's the Q function and the value function here very similar but it only cares about the state it says if I'm in state s and I just follow the policy PI even in the first step right I just follow this policy pi what is my expected reward going to be over the course of the episode that is the Q and the value functions you can see why Q learning is popular if you have a good Q function and the Q and the value function these are the things that you actually want to learn right if you have a good Q function you can simply always plug in every action into your Q function and then simply take the maximum the action that has the maximum Q value because that will can that will give you the best reward if if your policy PI right it's kind of self referential if your policy to is to always take the maximum Q value then taking the maximum Q value with the policy given that you take the maximum Q value will be optimal all right this was very convoluted but alright so let's let's start off with modeling the environment in this continuous framework so instead of having the next state be determined by the current state in action in the continuous framework they do this via differential equation so the DS is how does the environment change this is the change in the environment that is determined by two functions F and G so f is your classic environment function it takes in a state and an action at time T right these are not functions and it will it will output how the state changes and the G here is this is a wiener process is to introduce stochasticity as I understand it because in the classic formulation the transition model gives you a probability up here a probability distribution so this Wiener process is responsible for introducing that probabilistic nature into this differential equation but ultimately it simply tells you how does the state change depending on my state current state and action that I perform so the reward function now is also pretty simple the Tao here is a trajectory and the trajectory is simply the state and action over time so if I integrate from time zero to infinity or to the end of the episode my reward function at each point in time right so I go through my episode and I get higher reward not so high and so on so the integral under this curve will be my total reward just like we sum up the reward of individual steps in the discrete case in the continuous case you can think of each infinitesimal time step giving you a tiny bit of reward so the entire reward is just an integral then we go on the value function for a given state at time T right so think about what this is the value function for a state means what reward can I expect starting in this particular state and then following policy PI until the end of the episode and that here is the expectation over all trajectories that come from my policy of the reward in that trajectory so I can you know if I'm here my policy now is also a distribution it can go multiple trajectories right and I want to I want to have the expected value of the reward so each one of these has a reward the expected value of the reward over all trajectories starting from state s T and again here you say that that is the integral over T now here I have a bit of a problem because here they say T equals zero going from here and here but here the T is already here so I leave this should be this should be T equals Prime and then T prime T prime T prime up here and t minus T prime or something like this in any case I think it should it should actually start from this state here and not from time zero but I might be missing something I'm not the biggest integrator in the world so you know alright then you have the Q function now think of it what the Q function is in the discrete case the Q function tells you if I'm in state s and perform action a what is my expected reward going to be that I have to introduce some different things here they say if I'm in state s and I act action a at time T until time H right now you have to say how long you're gonna perform the action for until you perform the next action right so H is your Ural act I'm here until you perform the next action so this now I actually agree with this formulation with the integral here so this is going to be the integral from time T to time T plus h that's how it long you perform the action your reward of performing that action right given the state plus the value function at the end of that so you're ear here you're an ST and you perform action a right and then this is your state at time T plus h and then you're here and from there on you could perform many many many actions right but in the original notion of the Q function the Q function tells you if I'm here and I perform this action and after that I act according to pot to a policy PI what is my what is my expected reward and but there's a classic recurrence relation in reinforced learning where you can say the Q function in st given to a is the reward that I get from performing a in state s plus the value function at state s at the stat the next state right because the value function is exactly the reward that you would get by following policy PI in that next state and the Q function means I perform a now and after that I perform PI so this is the continuous analog that's why you have this part here where you perform the action for each time and after each time you just go after go with your policy and that will be the value function so this is the continuous formulation of the of the problem right and now they can introduce these these lagging times so in their diagram up here they define these notions so you have your state s T right here then after this time you capture the new state right so after that time you capture the new state and decide on an action and then you perform it for each time is that correct until here so the the the I minus one of action is performed at this time and the I F action is performed at this time no that makes no sense so let's read it ah so this is when you capture the state and you need time to perform to think right this is thinking and then you perform this action at that time this is the lag time now and you perform this action you want to know you want to know if I perform this action until this time here what is what is happening so this is the new cue function takes into account this thing it tells you if I'm in state s and I think this is thinking leads me to here this is the old action right this is the old action that's still happening well I observe this state right so it means if I do this right now and after thinking I do this right so I'm at state I'm at time T and this is still happening and then after I think thinking leads me here T plus TAS I perform this new action a lot of colors I perform this new action at that point until time H what's my cue function so my cue function is going to be the integral time T where I start observing the state and start thinking until T plus TAS that's when I still perform the old action right so this is going to read the reward in the state given the old action and then at that time I switch over to the new action right so at that time until time age now I perform the new action so this entire part here this part until here is taking the place of this first part here in the cue function of this first part right so because before it was simply executing one action we didn't have this concurrency yet so executing the action and after that it's going to be the value function and now it's executing two actions first execute the old action then once you're done thinking execute the new action and then it's the value function from their own I hope this is clear it wasn't clear to me until just now as well alright so they define the Monte Carlo estimator where you can do this with just samples of the of trajectories instead of expectations and then they define the bellmen operator the bellmen backup operator now the bellman back-up operator is an important quantity in value-based reinforcement learning because the bellman backup operators basically what I talked about before it's it tells you that if your policy is them to always select the maximum the action with the maximum Q value right that's what's down here after you do this action then the then the policy you arrive at and you can give certain optimality guarantees but in essence this is clear called a contraction so if you always do that and you calculate your Q function that way it will mean that in the contraction is defined as if you have an operator if you have two things that are X 1 and X 2 that are some apart from each other then after you apply the operator this T here X 1 minus tx.2 they will be closer together which basically means that the q TQ functions of the individual states will be closer together and you'll converge to a single Q function right so given enough time and enough data you'll converge on one q function there's one fixed point q function that you'll converge to and you can show under assumption in classic RL that this is going to be the the optimal Q function the the true let's say Q function so they first prove this and then they prove it now they go back to discrete time so now they were in continuous time they go back to discrete time but now they have a discrete time formulation with this lag here and also they prove that that bellman operator is a contraction so the contraction part basically means that if you perform Q learning you're going to arrive at a solution that's what it means to be contraction but now obviously that solution in classic RL is going to be the the optimal Q function but here I actually don't know all right so they try this out and they introduce one last important concept here what they call a vector to go which basically means that at the point where they start thinking where is a good thing to show this at the point where they start thinking they give a they give the the last action with so at this point right here where they sorry where they capture the state they also sort of the state contains a information about what part of the action that you started here is still outstanding so maybe your action was and they illustrate this down here maybe your action was to move your robot arm from down here to up here right that was your plan to action at this point in time now if you are at step if you perform the action here and here you start capturing the next state then you would also give this particular vector here to the to that to the agent so not only will you tell it hey by the way my last action was a t minus one as you would need in the q-value you will also say and this much is outstanding this is much is whereas I I still have to do that much so basically you're saying I wanted to move my arm right here and I still have to do this part of the action now you can see while the algorithm is able to learn much better given that information because otherwise it has it would have to basically infer that vector from kind of differencing the action - the what probably happened in the meantime right so they test this out and what results is the robot videos you've seen before where they say they can recover the original the original q-learning in this continuous framework so here on the left side you have blocking actions and it says when it says yes here it is kind of the old old framework you see the grasp success at like 92% where as if you go to non-blocking actions but do none of the none of the concurrent information the grasp success suffers but you can recover the grasp success if you if you give these concurrent information like even introduce a time step penalty and you give this vector to go and the information about the previous action you can also see that the episode duration here is much lower when you go for the continuous actions then when you are in the old framework naturally because you don't need to pause right in this so this is the simulated robotics and the real world robotic grasping results you see kind of similar results in that if you do have blocking actions your grasp success is higher than if you don't but your duration of your of your policy is cut in half so maybe this is a trade-off worth considering I think this is a is pretty cool framework and I think there's going to be a lot of work still outstanding here and I invite you to check out the paper and look at their videos and their ablation studies of what's important and what not and with that bye bye\n"