Scaling the Mountain with Continuous Actor Critic Methods _ PyTorch Tutorial

Creating an Agent with Deep Neural Networks and Mountain Car Environment

To create an agent that can learn to navigate the Mountain Car environment, we need to initialize our agent with appropriate parameters. We start by importing the necessary modules and defining the learning rates for our agent. The alpha value is set to 0.123456, which is a small learning rate, while the beta value is slightly larger at 1e-4. The input dimensions are specified as [2], indicating that we have two inputs in our environment.

Next, we define our deep neural network using a tuple unpacking technique. We use two layers with sizes of 256 and 256, respectively. These layers will form the foundation of our agent's decision-making process. We also specify the Mountain Car environment, which has two parameters: the car's position and velocity. Additionally, we set gamma to 0.99, which is a discount factor that determines the importance of future rewards.

We then create an instance of the MountainCarContinuousEnv class from the gym library, passing in our defined parameters. We initialize our agent with the alpha value and learning rate schedule specified earlier.

To train our agent, we iterate over a large number of episodes (approximately 100) and perform the following steps:

1. Initialize the environment and reset it to its initial state.

2. Set `done` to False, indicating that the episode has not yet ended.

3. Select an action from the environment using our agent's policy network.

4. Pass in the current observation, reward, new state, and done flag to our agent's actor network to get the action.

5. Update our agent's old state with the new state.

By following these steps, we can train our agent to learn a policy that maps observations to actions. The goal is to maximize the cumulative reward over many episodes.

After training our agent for an episode, we keep track of the score and append it to the `score_history` list. We also print out a message indicating the completion of each episode.

Finally, we use our `plot_learning` function to visualize the output of our `score_history` list. This function plots the learning curve over many episodes, allowing us to see how our agent's performance improves or deteriorates over time.

Running the Code in the Terminal

To run our code, we can execute it in a Python environment using an IDE such as PyCharm or Visual Studio Code.

When we run the code, we encounter two errors related to variable scope and reference before assignment. These issues arise due to the use of local variables within nested scopes. By addressing these issues and adjusting our code accordingly, we can resolve the errors and execute our script successfully.

Upon executing the code, we observe that our agent settles on a reward near zero, indicating that it fails to navigate the Mountain Car environment effectively. This outcome is attributed to the fact that our agent never achieves a cumulative reward of 100, as it never learns to overcome the obstacle. As a result, its policy network tends to minimize negative rewards, leading to instability and suboptimal performance.

The limitations of after-critic methods become apparent in this scenario, where the agent struggles to learn from experience due to instability issues. However, this example also highlights the importance of carefully designing rewards and using techniques such as actor-critic methods or other advanced algorithms to improve policy learning.

In conclusion, creating an agent that can navigate the Mountain Car environment is a challenging task that requires careful consideration of reward design, policy learning, and stability. By understanding the limitations of after-critic methods and exploring alternative approaches, we can develop more effective agents that overcome complex problems in reinforcement learning.

"WEBVTTKind: captionsLanguage: enwhat's up everybody in this tutorial you were gonna code an actor critic agent into continuous action space that's right we're gonna go after the continuous mountain car problem using PI torch you don't need any prior knowledge you don't need any experience with PI torch reinforcement learning you just need to be able to follow along now full disclosure this agent doesn't actually learn to beat the environment for reasons that I'll make clear as we get later in the video nevertheless it's really important to study the continuous actor critic problem because it kind of serves as the basis for more advanced algorithms like PPO and deep deterministic policy gradients DD PG so in this video we're gonna go ahead and show all the strengths and limitations of the algorithm just in the interest of transparency and full disclosure let's get to it so as always we start with our imports now as I often do we are going to split this into two separate classes the first class is going to be a generic deep neural network with two hidden layers with a variable number of outputs based on what we need and the second class will be the aging class that handles the functionality for learning choosing actions as well as instantiating the two networks now in actor critic methods we are going to use two deep neural networks one for the actor which tells the agent what action to choose in other words it approximates the policy and the other network is going to be the critic that tells the agent the value of that policy these two play off of each other the critic kind of tells the actor what is good and what is not and they both sample transitions from the environment to minimize the error and their loss functions which causes the agent to converge on a semi optimal policy as I said earlier we're not going to beat the environment but we're gonna get a an interesting solution nonetheless so let's start with the generic Network first now all pi torch classes derive from NN module that's the base class for pi torch and it's what allows us to access the agents parameters as well as some other interesting functionality of course our initialize function is going to take a learning rate input dims the dimensions for the first fully connected layer dimensions for the second fully connected layer as well as the number of actions and we want to call the super constructor I forgot the self so now we want to keep track of all of our parameters next we want to actually construct our network so as I said it's going to be a pretty simple deep neural network with two fully connected layers since the mountain car problem is only two dimensional we really don't have to worry about convolutions although if you wanted to do a convolutional neural network to parse screen images for say space invaders or pong or breakout then this would be the place that you would do it now if you're not familiar with it this verbage here this idiom star self input dims this will unpack a list or a tuple so this gives us some functionality where if we wanted to extend this to include convolutions we could or even not necessarily convolutions but environments that have multi-dimensional observation vectors of course mountain car only has a single dimension but nonetheless let's make it extensible just for giggles and of course the second layer is going to be linear as well and it's going to connect the dimensions for the first fully connected layer to the dimensions of the second fully connected layer and the final output is another linear function and that takes FC two dims as inputs and outputs and actions so this will be the basis for our deep neural network for both the actor and the critic the difference between them is going to be the actor will have two outputs and the critic will have one we'll get to that momentarily the next bit we need is the optimizer and that is going to be the atom optimizer and what do we want to optimize is going to optimize the parameters of this network and that's where we get the self dot parameters from if you you may notice that we don't we've never defined this and that's where this derivation from the n n dot module comes in so driving this class from the base class n n dot Manto gives us access to self dot parameters and so we're going to optimize the parameters themselves with the learning rate equal to whatever it is that we specify next thing we need to worry about is the device we want to run this on so if you have a GPU it's a really good idea to use it if you don't don't worry it's not a huge problem so what we're going to do is say cout to 0 for the 0th GPU of course if you have more than one it'll be 0 1 all the way up to n minus 1 GPUs that you have if T Gouda dot is available so what this will do is make sure you have enough RAM available on the GPU to actually support adding another otherwise we'll say CPU if you have more than one G GPU then it would be say CUDA one or what other ever other GPU number you want to use so if you want to you know run multiple models on different GPUs you are more than capable of doing that with Pi torch and finally you want to send the entire network to the device this is a critical step to make sure you're actually computing on whatever device you're specifying and that is it for the definition of our network next we need to define the forward pass and this is a little bit interesting because the actor and critic do totally different forward passes because their computing totally different things and we'll get to the details on that momentarily but for now let's go ahead and define the four feed-forward functions so the first thing you want to do is make sure that your state is actually a tensor in the case of the open AI gym it's going to come in as a numpy array and pi torch is really particular about the types of data that you pass it so you want to make sure that you cast it to a a PI torch tensor and specify the D type of tensor dot float and you also want to cast it to the device this makes sure that it is a CUDA float tensor and not just a float tensor those are different things and it's to make sure that all of the data types of the inputs match up to the datatypes of the network that we've defined up here so next thing we need to do is go ahead and feed for that observation the state through the first layer of the network and perform a value activation so that will perform the first feed-forward and next we want to do the next layer of course and of course that will just take X as input because we've gone from the state sorry from from state to X here and so now X is the input for the second layer and the final layer is just going to be the the third fully connected layer with no activation and the reason is that we're going to handle activation later when we choose actions and when we yeah when we choose actions and the reason is that the activation the range of values that are permissible for the state state value function and the actions are totally different so you don't want to actually activate it here you just want to feed it forward and handle the activation later so that is it for the generic Network class it is very straightforward as you can see it's pretty pretty simple so the next thing we need to do is code up our agent class and this just arise from the base object and we need its own initializer and this will take two different learning rates because the two networks the actor and the critic network have their own learning rate so you'll have a learning rate for the actor called alpha learning rate for the critic called beta the input dimensions the gamma which if you're not familiar with it gamma is the discount factor for future rewards so the agent takes in well agent performs a learning step it calculates the value of the present state as well as the next state and you want to discount that next state by some amount and that number ranges between zero and one where a zero makes the agent totally nearsighted only considers the current reward it's receiving from the current state transition and one makes it totally farsighted where it places equal weight on both transitions for this state and the next one and typically its use something and the point nine and above range so we're going to default in actions to two because we already know we're dealing with the mountain car problem and now keep in mind don't get confused here this is the continuous mountain car problem I just want to specify but we do need two actions quote/unquote two actions because of the way that we will calculate the agents policy which I'll get to momentarily so layer one size layer to size and I have to specify defaults for those let's default - I don't know 64 or something like that and we need something calls call the number of outputs which I will get to in a minute of course so let's go ahead and save our parameters gamma we need something called the log probs and the log probs is going to be used in the calculation of the cost function for the actor Network this is just the log of the probability of selecting an action so just to be clear the actor is the deep neural network estimates of the probability of choosing an action given you're in some state or the policy so you're dealing with a probability and when you do the gradient of that quantity you end up with a log term and so you need something that gives you the log probabilities which is this quantity which we just initialize to none we'll calculate it momentarily next we need to define our actual actor and critic networks so the actor is just a generic network with a learning rate of alpha input dims layer 1 size and of course layer 2 size and inactions will just be and actions next we need a critic network and it will be pretty similar of course there are differences a difference is being that we need to specify a learning rate of beta we're going to use the same input dimensions and layer 1 and layer 2 size that's a typo so you could in principle specify different sizes for your actor and critic networks in other words the number of nodes in the hidden to hidden units sorry hidden layers but I haven't experimented with that and it just seems like an unnecessary hyper parameter to tune let's just simplify things and make it nice and straightforward and make them the same so next point of contention is the number of actions so of course for a continuous problem you really only have one action but how do you choose an action you do it through a probability distribution in this case a normal distribution and so that normal distribution is characterized by a mean and a standard deviation two quantities and so we have two actions for the actor in other words two outputs not really two actions but two outputs for the actor Network and the critic Network is tasked with estimating the value function for that particular state so it only has a single output because you're only computing a a number a real valued number so we instantiate the two networks and then we are done with the constructor for the agent next we need a way of choosing actions and that of course just takes the observation as input and so the next thing we have to worry about is feeding forward the observation through the network and recall for the actor Network we have two outputs so we're gonna get a mu and a Sigma out and we get that from passing forward the observation through the actor network now the recall that we have simply passed back the value without activating it and so that Sigma could very well be a negative number and of course a negative standard deviation is a totally meaningless quantity right as standard deviation is real valued strictly so you have to make sure that the standard deviation is real valued and in particular non zero right it's not a Dirac Delta function it is a normal distribution so we're just going to exponentiate it with the e to the power of Sigma so then I'll give you some real some positive quantity next up you want to calculate the probabilities of taking the actions by using the tensor sorry the torch distributions in this case a normal distribution modeled by mu and Sigma and so what we're doing here is calculating that normal distribution characterized by mu and Sigma and of course mu and Sigma are the outputs of the neural network for the actor class and that is what the agent is going to seek to learn is going to seek to learn mu and Sigma such that we can maximize the agents reward over time next we want the actual probabilities and sorry an actual action and so we're gonna sample that normal distribution but we need a shape right so you have to know how many things to sample in the normal distribution and that's given by sample shape and that's a tensor size cell 10 outputs so here n outputs is just 1 and this is what will give us the single sample of a normal distribution characterized by mu and Sigma that is a continuous quantity we haven't found it yet but it is a single continuous quantity from a normal distribution next we need to actually calculate the log probability of that and this comes into play when we're doing the actual learning for our network so it's action probs dot log problem of probs and we're gonna send this to the actress device so this gives us the log of that probability and that's used in calculating the loss function for our deep neural network and finally the this quantity is just from the from a normal distribution and it could have really any bounds right it could be from plus -5 minus 5 to the minus 3 or whatever which are of course meaningless quantities for the mountain car problem so we want to bound those actions between plus and minus 1 which if you look at the documentation of the mountain continues mountain car problem is the actual limit of the actions so to do that you wanted to use the tan hyperbolic or tan CH of the probabilities and so this will take whatever number this comes out as and bound it to plus or minus 1 so you end up with a continuous number between plus and minus 1 and then what you want to do is return the action because you've selected it but you don't want to return a tensor so this is an actual tensor and you don't want to return the tensor because the opening item doesn't take tensors as input for the step function so you have to actually get the value of the tensor by calling dot item function so now we have figured out how to choose an action and let me change that to choose action not choose actions and finally we have to come to the learning functionality so the learning is a temporal different style which means that it learns at every time step using the old state and new state action and rewards this is very similar to cue learning and will even have a very similar kind of loss function to cue learning but that's kind of really where the similarities end this is in contrast to policy gradient methods which is of which actor critic is a subset but in policy gradient methods it's Monte Carlo style where you accumulate memories over the course of the episode and replay those memories and use that to train your network so in this case we're training the network every time step off of a single observation now in other variants of actor critic this is accommodated by facilitated by a replay memory similar to cue learning where you keep track of a whole set of transitions and then randomly sample them to train your network we're not doing that here we're just going to use one sample that's the current transition but just keep in mind that more advanced algorithms that actually do a better job at solving environments will keep track of a replay memory that it samples over time the samples randomly sorry so the learning function takes a state reward new state and done flag as input so of course we need the current state to get its value the reward the new state to get its value and the done flag because the agent doesn't get a reward after entering the terminal state right so once you're in the terminal state you get no further rewards you're done right you collect your prizes and you go home so you want to have a way of accommodating that and that is through the done flag that we pass in from the open AI gym with all with all sorry I can't talk here with all pi torch learning functions you want to zero out the gradients at the top of your learning function and the reason is that you don't want the gradients from a previous sample to affect the calculation of the gradients from the current sample and it also helps for the performance i've forgotten this and it chugs along after a few iterations so we have to calculate the value of the current state I'll say the next state by passing forward the new state through the critic Network and we also need the value of the current state by passing that through the critic Network keeping with the peculiarities of PI torch we need to cast the reward to be a torch tensor so we'll do that now without pressing the caps lock key and we want to cast a two tensor sorry T dot float and make sure that it's on the device it doesn't matter if you have actor or critic here it's just a reference to a particular device so next you want to calculate the Delta which is the temporal difference loss you can check out Sutton and Bartow for the explanation of this but it's basically the reward plus self dot gamma times the value of the next state critic value underscore times 1 minus the done flag and so what this does the 1 minus done is if the episode is over done is true so you have 1 minus into true 1 minus 1 is 0 so then it totally discards this term and just has the reward minus the value of the current state - critic value and of course if you're not done then this is just a 1 - into false so 1 minus 0 is 1 then it's the full expression so that's how we take into account the fact that we'd receive no rewards after the end of the episode so now let's calculate the actual losses so the actor loss is just negative self dot log probs times Delta for a better explanation of this you can check out my previous video on actor critic in discrete action spaces it's the same loss function and the critic loss is just this Delta squared and so this Delta will tell you the difference between the next state and current state and use that to guide the evolution of the probabilities in policy space over time with the actor loss and the critic loss just seeks to minimize that quantity so key thing here is that you have the first power of Delta so if it's positive it'll move the blog problems one way if it's negative and move log problems the other way and here it is just a delta squared so it is positive so it is strictly positive although it could be zero and really what you're doing there is trying to minimize this Delta between critic value underscore and critic value so next we want to sum the two and back propagate that loss you cannot say actor loss top backward and critic loss top backward you can only have a single back propagation step per learning function so you just have to sum them next actor the optimizer steps so once you have calculated grades naught stuff you want to go ahead and run your optimizer and if you don't do this it won't learn I have mistakenly done that just typo'd it out and wondered why I didn't learn that is pretty stupid okay so this is it for the continuous actor critic problem so we have gone ahead and coated up the two relevant classes we need and they're good so we're going to go ahead and code up the main function and test this out let's do that now so of course again we start with our imports we're gonna need numpy because we're gonna have to do some reshaping of our action we'll need Jim of course from actor critic continuous you us we want to import part agent will need my plot learning function I should just make a video on that one day would be pretty quick easy to do and if you want to save the output of the renders to an mp4 file you do from Jim import wrappers so let's go ahead and get to it we want to initiate initialize our agent with an alpha of zero point one two three four five is that right I'm looking at my cheat sheet now one two three I'm just gonna copy and paste it and these learning rates are quite small and I think I'm missing a zero and these are the rates are quite small and it's quite sensitive to perturbations in the inputs of the learning rates so the beta gets a slightly larger learning rate of one by ten to the minus four input dims is just a list of two and the it's the this is why we need the tuple or list unpacking in the definition of our deep neural network and of course the mountain car environment has aligned two parameters which is the mountain the cars position and velocity gamma 0.99 a layer one size of 256 and a layer two size of 256 so now we make our environment Mountain car continuous be zero we're gonna keep track of a score history so we know that it's learning a number of episodes something like a hundred so let's iterate over that and at the top of each set done to false score 2:0 reset your environment and then you're ready to play the actual episode so well not done you want to select an action from the environment choose action that takes the observation now a funny thing here is that the action has to be a numpy array so let's go ahead and handle that here numpy array and we'll reshape it to a single element just a peculiar T of this particular environment you wouldn't do this general necessarily so let's go ahead and take that action and after you've taken the action go ahead and learn passing in the old state reward new state and done flag and to make sure that you select an action based on the updated state set old state equal to new state keep track of your score and at the end of the episode go ahead and append that to the score history for plotting and do a print statement to let yourself know that it's running episode I sorry episode and for an eye and then the score percent to F percent score when it's all said and done go ahead and use my handy dandy plot learning function to plot the output score history file name and a window of say 20 most whatever that's 220 whatever window 20 games doesn't really matter so now let's go ahead and head to the terminal and see how many typos as I made one second alright so now we're in the terminal let's go ahead and see how many typos I made uh-huh I forgot an equal sign easy fix let's run it again and it's doing something local variable action referenced before assignment okay let us take a look here I had the code editor so that error is on line 49 Eagles I should be action underscore probs okay all right now back to the terminal see what other typos I have all right it's running so you can see now that it is running and the score is going down down down so I showed the plot learning function at the beginning of the video and I'm gonna go ahead and spin it in here now you see something very very curious it actually settles on a reward near zero and the reason is that if the agent never actually makes it over the hump to get to the goal then it never actually figures out it can get a reward of a hundred and so the best it can possibly do then is try to minimize the negative reward it gets and it does that by basically kind of sitting at the bottom exerting minimum actions rocking back and forth it's a pretty silly solution but it is a viable solution and it's kind of goes to show what happens when you don't really carefully design your rewards now this is an example of where after critic methods go wrong and the reason this happens is aside from the fact that it needs to get over that hump in the first place and it doesn't always do that but it happens that we get instability if you let it run longer than 100 episodes because the estimate of the value function is quite unstable so there are a number of ways of fixing that advantage actor critic asynchronous advantage after critic proximal policy optimization deep deterministic policy gradients you know a whole host of of algorithms that actually serve to fix after critic methods to make them significantly better and indeed they do work quite well when you apply some more finesse to them but this very simple version does relatively well considering what it is and actually gets a decent score in a an intelligent solution to a actually difficult problem so I hope this has been helpful if it was make sure to share this video it really helps me to get found leave a like a comment question you know whatever I answer all my my comments so leave them down below and I look forward to seeing you in the next videowhat's up everybody in this tutorial you were gonna code an actor critic agent into continuous action space that's right we're gonna go after the continuous mountain car problem using PI torch you don't need any prior knowledge you don't need any experience with PI torch reinforcement learning you just need to be able to follow along now full disclosure this agent doesn't actually learn to beat the environment for reasons that I'll make clear as we get later in the video nevertheless it's really important to study the continuous actor critic problem because it kind of serves as the basis for more advanced algorithms like PPO and deep deterministic policy gradients DD PG so in this video we're gonna go ahead and show all the strengths and limitations of the algorithm just in the interest of transparency and full disclosure let's get to it so as always we start with our imports now as I often do we are going to split this into two separate classes the first class is going to be a generic deep neural network with two hidden layers with a variable number of outputs based on what we need and the second class will be the aging class that handles the functionality for learning choosing actions as well as instantiating the two networks now in actor critic methods we are going to use two deep neural networks one for the actor which tells the agent what action to choose in other words it approximates the policy and the other network is going to be the critic that tells the agent the value of that policy these two play off of each other the critic kind of tells the actor what is good and what is not and they both sample transitions from the environment to minimize the error and their loss functions which causes the agent to converge on a semi optimal policy as I said earlier we're not going to beat the environment but we're gonna get a an interesting solution nonetheless so let's start with the generic Network first now all pi torch classes derive from NN module that's the base class for pi torch and it's what allows us to access the agents parameters as well as some other interesting functionality of course our initialize function is going to take a learning rate input dims the dimensions for the first fully connected layer dimensions for the second fully connected layer as well as the number of actions and we want to call the super constructor I forgot the self so now we want to keep track of all of our parameters next we want to actually construct our network so as I said it's going to be a pretty simple deep neural network with two fully connected layers since the mountain car problem is only two dimensional we really don't have to worry about convolutions although if you wanted to do a convolutional neural network to parse screen images for say space invaders or pong or breakout then this would be the place that you would do it now if you're not familiar with it this verbage here this idiom star self input dims this will unpack a list or a tuple so this gives us some functionality where if we wanted to extend this to include convolutions we could or even not necessarily convolutions but environments that have multi-dimensional observation vectors of course mountain car only has a single dimension but nonetheless let's make it extensible just for giggles and of course the second layer is going to be linear as well and it's going to connect the dimensions for the first fully connected layer to the dimensions of the second fully connected layer and the final output is another linear function and that takes FC two dims as inputs and outputs and actions so this will be the basis for our deep neural network for both the actor and the critic the difference between them is going to be the actor will have two outputs and the critic will have one we'll get to that momentarily the next bit we need is the optimizer and that is going to be the atom optimizer and what do we want to optimize is going to optimize the parameters of this network and that's where we get the self dot parameters from if you you may notice that we don't we've never defined this and that's where this derivation from the n n dot module comes in so driving this class from the base class n n dot Manto gives us access to self dot parameters and so we're going to optimize the parameters themselves with the learning rate equal to whatever it is that we specify next thing we need to worry about is the device we want to run this on so if you have a GPU it's a really good idea to use it if you don't don't worry it's not a huge problem so what we're going to do is say cout to 0 for the 0th GPU of course if you have more than one it'll be 0 1 all the way up to n minus 1 GPUs that you have if T Gouda dot is available so what this will do is make sure you have enough RAM available on the GPU to actually support adding another otherwise we'll say CPU if you have more than one G GPU then it would be say CUDA one or what other ever other GPU number you want to use so if you want to you know run multiple models on different GPUs you are more than capable of doing that with Pi torch and finally you want to send the entire network to the device this is a critical step to make sure you're actually computing on whatever device you're specifying and that is it for the definition of our network next we need to define the forward pass and this is a little bit interesting because the actor and critic do totally different forward passes because their computing totally different things and we'll get to the details on that momentarily but for now let's go ahead and define the four feed-forward functions so the first thing you want to do is make sure that your state is actually a tensor in the case of the open AI gym it's going to come in as a numpy array and pi torch is really particular about the types of data that you pass it so you want to make sure that you cast it to a a PI torch tensor and specify the D type of tensor dot float and you also want to cast it to the device this makes sure that it is a CUDA float tensor and not just a float tensor those are different things and it's to make sure that all of the data types of the inputs match up to the datatypes of the network that we've defined up here so next thing we need to do is go ahead and feed for that observation the state through the first layer of the network and perform a value activation so that will perform the first feed-forward and next we want to do the next layer of course and of course that will just take X as input because we've gone from the state sorry from from state to X here and so now X is the input for the second layer and the final layer is just going to be the the third fully connected layer with no activation and the reason is that we're going to handle activation later when we choose actions and when we yeah when we choose actions and the reason is that the activation the range of values that are permissible for the state state value function and the actions are totally different so you don't want to actually activate it here you just want to feed it forward and handle the activation later so that is it for the generic Network class it is very straightforward as you can see it's pretty pretty simple so the next thing we need to do is code up our agent class and this just arise from the base object and we need its own initializer and this will take two different learning rates because the two networks the actor and the critic network have their own learning rate so you'll have a learning rate for the actor called alpha learning rate for the critic called beta the input dimensions the gamma which if you're not familiar with it gamma is the discount factor for future rewards so the agent takes in well agent performs a learning step it calculates the value of the present state as well as the next state and you want to discount that next state by some amount and that number ranges between zero and one where a zero makes the agent totally nearsighted only considers the current reward it's receiving from the current state transition and one makes it totally farsighted where it places equal weight on both transitions for this state and the next one and typically its use something and the point nine and above range so we're going to default in actions to two because we already know we're dealing with the mountain car problem and now keep in mind don't get confused here this is the continuous mountain car problem I just want to specify but we do need two actions quote/unquote two actions because of the way that we will calculate the agents policy which I'll get to momentarily so layer one size layer to size and I have to specify defaults for those let's default - I don't know 64 or something like that and we need something calls call the number of outputs which I will get to in a minute of course so let's go ahead and save our parameters gamma we need something called the log probs and the log probs is going to be used in the calculation of the cost function for the actor Network this is just the log of the probability of selecting an action so just to be clear the actor is the deep neural network estimates of the probability of choosing an action given you're in some state or the policy so you're dealing with a probability and when you do the gradient of that quantity you end up with a log term and so you need something that gives you the log probabilities which is this quantity which we just initialize to none we'll calculate it momentarily next we need to define our actual actor and critic networks so the actor is just a generic network with a learning rate of alpha input dims layer 1 size and of course layer 2 size and inactions will just be and actions next we need a critic network and it will be pretty similar of course there are differences a difference is being that we need to specify a learning rate of beta we're going to use the same input dimensions and layer 1 and layer 2 size that's a typo so you could in principle specify different sizes for your actor and critic networks in other words the number of nodes in the hidden to hidden units sorry hidden layers but I haven't experimented with that and it just seems like an unnecessary hyper parameter to tune let's just simplify things and make it nice and straightforward and make them the same so next point of contention is the number of actions so of course for a continuous problem you really only have one action but how do you choose an action you do it through a probability distribution in this case a normal distribution and so that normal distribution is characterized by a mean and a standard deviation two quantities and so we have two actions for the actor in other words two outputs not really two actions but two outputs for the actor Network and the critic Network is tasked with estimating the value function for that particular state so it only has a single output because you're only computing a a number a real valued number so we instantiate the two networks and then we are done with the constructor for the agent next we need a way of choosing actions and that of course just takes the observation as input and so the next thing we have to worry about is feeding forward the observation through the network and recall for the actor Network we have two outputs so we're gonna get a mu and a Sigma out and we get that from passing forward the observation through the actor network now the recall that we have simply passed back the value without activating it and so that Sigma could very well be a negative number and of course a negative standard deviation is a totally meaningless quantity right as standard deviation is real valued strictly so you have to make sure that the standard deviation is real valued and in particular non zero right it's not a Dirac Delta function it is a normal distribution so we're just going to exponentiate it with the e to the power of Sigma so then I'll give you some real some positive quantity next up you want to calculate the probabilities of taking the actions by using the tensor sorry the torch distributions in this case a normal distribution modeled by mu and Sigma and so what we're doing here is calculating that normal distribution characterized by mu and Sigma and of course mu and Sigma are the outputs of the neural network for the actor class and that is what the agent is going to seek to learn is going to seek to learn mu and Sigma such that we can maximize the agents reward over time next we want the actual probabilities and sorry an actual action and so we're gonna sample that normal distribution but we need a shape right so you have to know how many things to sample in the normal distribution and that's given by sample shape and that's a tensor size cell 10 outputs so here n outputs is just 1 and this is what will give us the single sample of a normal distribution characterized by mu and Sigma that is a continuous quantity we haven't found it yet but it is a single continuous quantity from a normal distribution next we need to actually calculate the log probability of that and this comes into play when we're doing the actual learning for our network so it's action probs dot log problem of probs and we're gonna send this to the actress device so this gives us the log of that probability and that's used in calculating the loss function for our deep neural network and finally the this quantity is just from the from a normal distribution and it could have really any bounds right it could be from plus -5 minus 5 to the minus 3 or whatever which are of course meaningless quantities for the mountain car problem so we want to bound those actions between plus and minus 1 which if you look at the documentation of the mountain continues mountain car problem is the actual limit of the actions so to do that you wanted to use the tan hyperbolic or tan CH of the probabilities and so this will take whatever number this comes out as and bound it to plus or minus 1 so you end up with a continuous number between plus and minus 1 and then what you want to do is return the action because you've selected it but you don't want to return a tensor so this is an actual tensor and you don't want to return the tensor because the opening item doesn't take tensors as input for the step function so you have to actually get the value of the tensor by calling dot item function so now we have figured out how to choose an action and let me change that to choose action not choose actions and finally we have to come to the learning functionality so the learning is a temporal different style which means that it learns at every time step using the old state and new state action and rewards this is very similar to cue learning and will even have a very similar kind of loss function to cue learning but that's kind of really where the similarities end this is in contrast to policy gradient methods which is of which actor critic is a subset but in policy gradient methods it's Monte Carlo style where you accumulate memories over the course of the episode and replay those memories and use that to train your network so in this case we're training the network every time step off of a single observation now in other variants of actor critic this is accommodated by facilitated by a replay memory similar to cue learning where you keep track of a whole set of transitions and then randomly sample them to train your network we're not doing that here we're just going to use one sample that's the current transition but just keep in mind that more advanced algorithms that actually do a better job at solving environments will keep track of a replay memory that it samples over time the samples randomly sorry so the learning function takes a state reward new state and done flag as input so of course we need the current state to get its value the reward the new state to get its value and the done flag because the agent doesn't get a reward after entering the terminal state right so once you're in the terminal state you get no further rewards you're done right you collect your prizes and you go home so you want to have a way of accommodating that and that is through the done flag that we pass in from the open AI gym with all with all sorry I can't talk here with all pi torch learning functions you want to zero out the gradients at the top of your learning function and the reason is that you don't want the gradients from a previous sample to affect the calculation of the gradients from the current sample and it also helps for the performance i've forgotten this and it chugs along after a few iterations so we have to calculate the value of the current state I'll say the next state by passing forward the new state through the critic Network and we also need the value of the current state by passing that through the critic Network keeping with the peculiarities of PI torch we need to cast the reward to be a torch tensor so we'll do that now without pressing the caps lock key and we want to cast a two tensor sorry T dot float and make sure that it's on the device it doesn't matter if you have actor or critic here it's just a reference to a particular device so next you want to calculate the Delta which is the temporal difference loss you can check out Sutton and Bartow for the explanation of this but it's basically the reward plus self dot gamma times the value of the next state critic value underscore times 1 minus the done flag and so what this does the 1 minus done is if the episode is over done is true so you have 1 minus into true 1 minus 1 is 0 so then it totally discards this term and just has the reward minus the value of the current state - critic value and of course if you're not done then this is just a 1 - into false so 1 minus 0 is 1 then it's the full expression so that's how we take into account the fact that we'd receive no rewards after the end of the episode so now let's calculate the actual losses so the actor loss is just negative self dot log probs times Delta for a better explanation of this you can check out my previous video on actor critic in discrete action spaces it's the same loss function and the critic loss is just this Delta squared and so this Delta will tell you the difference between the next state and current state and use that to guide the evolution of the probabilities in policy space over time with the actor loss and the critic loss just seeks to minimize that quantity so key thing here is that you have the first power of Delta so if it's positive it'll move the blog problems one way if it's negative and move log problems the other way and here it is just a delta squared so it is positive so it is strictly positive although it could be zero and really what you're doing there is trying to minimize this Delta between critic value underscore and critic value so next we want to sum the two and back propagate that loss you cannot say actor loss top backward and critic loss top backward you can only have a single back propagation step per learning function so you just have to sum them next actor the optimizer steps so once you have calculated grades naught stuff you want to go ahead and run your optimizer and if you don't do this it won't learn I have mistakenly done that just typo'd it out and wondered why I didn't learn that is pretty stupid okay so this is it for the continuous actor critic problem so we have gone ahead and coated up the two relevant classes we need and they're good so we're going to go ahead and code up the main function and test this out let's do that now so of course again we start with our imports we're gonna need numpy because we're gonna have to do some reshaping of our action we'll need Jim of course from actor critic continuous you us we want to import part agent will need my plot learning function I should just make a video on that one day would be pretty quick easy to do and if you want to save the output of the renders to an mp4 file you do from Jim import wrappers so let's go ahead and get to it we want to initiate initialize our agent with an alpha of zero point one two three four five is that right I'm looking at my cheat sheet now one two three I'm just gonna copy and paste it and these learning rates are quite small and I think I'm missing a zero and these are the rates are quite small and it's quite sensitive to perturbations in the inputs of the learning rates so the beta gets a slightly larger learning rate of one by ten to the minus four input dims is just a list of two and the it's the this is why we need the tuple or list unpacking in the definition of our deep neural network and of course the mountain car environment has aligned two parameters which is the mountain the cars position and velocity gamma 0.99 a layer one size of 256 and a layer two size of 256 so now we make our environment Mountain car continuous be zero we're gonna keep track of a score history so we know that it's learning a number of episodes something like a hundred so let's iterate over that and at the top of each set done to false score 2:0 reset your environment and then you're ready to play the actual episode so well not done you want to select an action from the environment choose action that takes the observation now a funny thing here is that the action has to be a numpy array so let's go ahead and handle that here numpy array and we'll reshape it to a single element just a peculiar T of this particular environment you wouldn't do this general necessarily so let's go ahead and take that action and after you've taken the action go ahead and learn passing in the old state reward new state and done flag and to make sure that you select an action based on the updated state set old state equal to new state keep track of your score and at the end of the episode go ahead and append that to the score history for plotting and do a print statement to let yourself know that it's running episode I sorry episode and for an eye and then the score percent to F percent score when it's all said and done go ahead and use my handy dandy plot learning function to plot the output score history file name and a window of say 20 most whatever that's 220 whatever window 20 games doesn't really matter so now let's go ahead and head to the terminal and see how many typos as I made one second alright so now we're in the terminal let's go ahead and see how many typos I made uh-huh I forgot an equal sign easy fix let's run it again and it's doing something local variable action referenced before assignment okay let us take a look here I had the code editor so that error is on line 49 Eagles I should be action underscore probs okay all right now back to the terminal see what other typos I have all right it's running so you can see now that it is running and the score is going down down down so I showed the plot learning function at the beginning of the video and I'm gonna go ahead and spin it in here now you see something very very curious it actually settles on a reward near zero and the reason is that if the agent never actually makes it over the hump to get to the goal then it never actually figures out it can get a reward of a hundred and so the best it can possibly do then is try to minimize the negative reward it gets and it does that by basically kind of sitting at the bottom exerting minimum actions rocking back and forth it's a pretty silly solution but it is a viable solution and it's kind of goes to show what happens when you don't really carefully design your rewards now this is an example of where after critic methods go wrong and the reason this happens is aside from the fact that it needs to get over that hump in the first place and it doesn't always do that but it happens that we get instability if you let it run longer than 100 episodes because the estimate of the value function is quite unstable so there are a number of ways of fixing that advantage actor critic asynchronous advantage after critic proximal policy optimization deep deterministic policy gradients you know a whole host of of algorithms that actually serve to fix after critic methods to make them significantly better and indeed they do work quite well when you apply some more finesse to them but this very simple version does relatively well considering what it is and actually gets a decent score in a an intelligent solution to a actually difficult problem so I hope this has been helpful if it was make sure to share this video it really helps me to get found leave a like a comment question you know whatever I answer all my my comments so leave them down below and I look forward to seeing you in the next video\n"