Reinforcement Learning in the OpenAI Gym (Tutorial) - Double Q Learning

Epsilon Greedy Strategy for Dealing with the Explorer Exploit Dilemma

The epsilon greedy strategy is a method used to deal with the explorer exploit dilemma, where an agent must decide whether to explore its environment or exploit its best known action. This dilemma arises when an agent is trying to learn the optimal policy, but it does not have enough experience to know what the best action is. In such cases, the agent may need to balance exploration and exploitation in order to make progress towards its goal.

To use the epsilon greedy strategy, some percentage of time must be allocated for exploring new actions, while the remaining time should be spent exploiting the best known action. The way this strategy is implemented involves gradually decreasing the exploration parameter epsilon over time. This allows the agent to start with a higher level of exploration and then gradually reduce it as it gains more experience.

The goal of using the epsilon greedy strategy is to find a balance between exploring new actions and exploiting the best known action. By allocating some percentage of time for exploration, the agent can discover new actions that may lead to better outcomes. At the same time, by spending most of its time exploiting the best known action, the agent can ensure that it does not get stuck in suboptimal solutions.

Dealing with Epsilon

Epsilon is a parameter that represents the probability of choosing an action randomly. When epsilon is 1, the agent will always choose an action at random, and when epsilon is 0, the agent will never choose an action at random. In practice, epsilon is usually set to a value between 0 and 1.

The way epsilon is implemented can affect the performance of the algorithm. A higher value of epsilon means that the agent will explore more, but may also get stuck in suboptimal solutions. On the other hand, a lower value of epsilon means that the agent will exploit more, but may not discover new actions that could lead to better outcomes.

Plotting Running Averages

To visualize the performance of the algorithm, running averages can be used. The running average is calculated by averaging the rewards received by the agent over a large number of episodes. By plotting the running average over time, it is possible to see how well the agent is doing and whether it is converging towards an optimal solution.

In the case of the cart pole problem, the agent receives a reward of +1 every time the pole stays vertical. The running average can be plotted by averaging the rewards received over a large number of episodes. By looking at the plot, it is possible to see how well the agent is doing and whether it is converging towards an optimal solution.

Epsilon Greedy Implementation

In this example, we are using the epsilon greedy strategy with a gamma value of 1.0. The algorithm starts by choosing an action randomly with probability epsilon, and then chooses the best known action with probability (1 - epsilon). The agent learns from its experiences and gradually decreases the value of epsilon over time.

We can also see how different hyperparameters affect the performance of the algorithm. In this case, we have changed the gamma value to 0.9 and observed a significant difference in performance. This highlights the importance of tuning hyperparameters when using reinforcement learning algorithms.

Double Q-Learning

Double Q-learning is an extension of the original Q-learning algorithm that uses two estimates of the action values instead of one. The first estimate, known as Q1, is updated based on the target values, while the second estimate, known as Q2, is updated based on the target values and a bonus term.

The use of double Q-learning can have a significant impact on the performance of the algorithm. In this case, we are using a gamma value of 0.9, which results in a faster convergence to an optimal solution. However, it also means that the agent may get stuck in suboptimal solutions if the bonus term is not properly tuned.

Conclusion

Reinforcement learning is a powerful technique for training agents to make decisions based on rewards and penalties. The epsilon greedy strategy is a common approach used to deal with the explorer exploit dilemma, where an agent must balance exploration and exploitation. By using different hyperparameters, such as gamma, it is possible to tune the performance of the algorithm.

In this article, we have discussed the epsilon greedy strategy for reinforcement learning, including its implementation and the importance of tuning hyperparameters. We have also looked at the effects of different gamma values on the performance of the algorithm and introduced double Q-learning as an extension of the original Q-learning algorithm.

"WEBVTTKind: captionsLanguage: enwelcome back everybody to machine learning with Phil I am your host dr. Phil in yesterday's video we took a look as sarsa in the opening I Jim getting the cart Pole to balance itself as promised today we are looking at the algorithm of double Q learning also in the cart Pole opening IgM environment so we touched on Q learning many many months ago and the basic idea is that Q learning is a model free bootstrapped off policy learning algorithm what that means is model free it does not need to know it does not need the complete state transition dynamics of the environment to function it learns the game by playing it bootstrapped in that it doesn't need very many very much help getting started it generates estimates using its initial estimates which are totally arbitrary except for the terminal States all policy meaning that it is using a separate policy other than it is using a behavioral policy and the target policy to both learn about the environment and generate behavior respectively now when you deal with problems that when you deal with algorithms that take a maximizing approach to choosing actions you always get something called maximization bias so say you have some set of states with many different actions such that the action value function for that state and all actions is 0 then the agents estimate the Q capital Q of SN a can actually be will actually have some uncertainty to it and that uncertainty is actually a spread in the values right and that spread causes it to have some amount of positive bias and the max of the true values is 0 but the max of the capital Q the agents estimate is positive hence you have a positive bias and that can often be a problem in in reinforcement learning algorithms so this happens because you're using the same set of samples to max to determine the maximizing action as well as the value of that action and one way to solve this problem is to use two separate q functions to determine the max action and the value and you set up a relationship between them and then you alternate between them as you play the game so you're using one of them to determine the max action one of the determine its value and you alternate between them so that you eliminate the bias over time that's double Q learning in a nutshell the algorithm is the following so you initialize your alpha your epsilon where alpha is your learning rate epsilon is what you use for epsilon greedy you would initialize the Q 1 and Q 2 functions for all states and actions in your state and action space of course that's arbitrary except at the terminal states we must have a value of zero and you loop over your set of episodes and you initialize your state and for each episode right each step within the episode choose an action from using your state using an epsilon greedy strategy in the sum of Q 1 and Q 2 so you have the two separate key functions so if you're using single Q learning you would take the max action over just one Q but since you're dealing with two you have to account for that somehow right you could do a max you could do a sum you could do an average in this case we're gonna take the sum of the two Q functions take that action get your reward observe the new state and then with a 0.5 probability either update q1 or q2 according to this update rule here and of course at the end of the step go ahead and set your estate to the new state and keep looping until the game is done so clear as mud I hope so by the way if you want more reinforcement learning content make sure to subscribe hit the bell icon so you get notified let's get to it so next up we have our code and here we are inside of our code editor again I'm using Visual Studio code to take a look at our double Q learning script I'm not gonna be typing into the terminal I think that's probably a little bit annoying I'm just gonna review the code as we go along if you have seen my video on the source and algorithm there's gonna be a fair amount of overlap because we're solving the same set of problems over again the only real difference is in that video we use source to calculate the action value function and in this case we're using double Q learning again we have a max action function what this does is tells us the max action for a given state to construct that you make a numpy array out of a list that is for a given state in both actions and as we said in the video we're going to take the sum of the q1 and q2 for given state for both actions you want to take the arc max of that and recall in numpy the Arg max function if there is a tie returns the first element so if the left and right actions both have identical action value functions then returned the left action consistently that may or may not be a problem is just something to be aware of mm-hmm and once again we have to discretize the spaces recall that the carpool problem which is just a cart sliding along a track with a pole that is that must be maintained vertically right and the cart pole example we have a continuous space the X and the theta can be any number within a given range and likewise for the velocities to deal with that we have a couple options we could simply use neural networks to approximate those functions but in this case we're going to use a little trick to discretize the space so we're going to divide it up into ten equal chunks and any number that falls within a particular chunk will be assigned an integer so you'll go from a continuous to a discrete representation of your for vector the observation along with that comes a gets state observing state along that comes a get state function that you pass in the observation and it just uses those digitized spaces excuse me just uses those linear spaces to use the numpy digitize function to get the integer representation of the respective elements of your observation I've also added a function to plot the running average here I do this because in this R sub video we ended up with a little bit of a mess with 50,000 data points this will plot a running average over the prior 100 games next up we have to initialize our hyper parameters our learning rate of 0.1 this just controls the step size in the update equation the gamma is of course the discount factor the agent uses in its estimates of the future rewards so I don't believe this should actually be 0.9 IU I left it here because it's not super critical as far as I'm concerned it should really be 1.0 and the reason is that the purpose of discounting is to account for uncertainties in future rewards if you have some sequence of rewards with a probability of receiving them then it makes no sense to give each of those rewards equal weight because you don't know if you're going to get them in the cart poll example the rewards are certain as far as I'm aware of the state transition probabilities are one you know that if you move right you're going you end up moving right you know deterministically where the pole and the cart are gonna move so it shouldn't be discounted as far as I'm concerned epsilon is just the epsilon factor for our for our epsilon greedy algorithm and that's pretty much it for hyper parameters of the model next up we had to construct our state space so what this means oh baby's unhappy the state space is of course the the representation of the digitized space so we're gonna have for the cart position you're gonna have ten buckets the velocities ten buckets and likewise for the thetas thetas position and theta velocity so you're gonna have ten to the four possible states so 10,000 states and those are gonna be numbered all the way from 0 0 0 to 9999 that's all we're doing here is we're constructing the set of states next up we have to initialize our Q functions recall that the initialization is arbitrary except for the terminal state which must have a value of 0 the reason for this is that the terminal state by definition has a future value of 0 because you stopped playing the game right makes sense you could initialize this randomly you could initialize it with minus 1 plus 1 doesn't really matter so long as the terminal state is 0 for simplicity I'm initializing everything at 0 I'm gonna play a hundred thousand games the reason is that this algorithm eliminates bias but at the expense of convergence speed so you have to let it run a little bit longer an array for keeping track of the total rewards and we're gonna loop over a hundred thousand games printing out every 5,000 games to let us know it's still running I always want to reset you're done flying your rewards and reset the episode at the top and you're gonna loop over the episode getting your state calculating a random number for your epsilon greedy strategy you're gonna set the action to be the max action of Q 1 and Q 2 if the random number is less than 1 minus Epsilon otherwise you're gonna randomly sample your action space in any event you take that action get your new state reward and done flag and go ahead and tally up your reward and convert that observation to a state s Prime then go ahead and calculate a separate random number the purpose of this random number is to determine which q function we're going to update you know we're gonna be using one to calculate we're alternating between them because we have to eliminate the maximization bias right one is for finding the max action one is for finding the value of that action we alternate between episodes by way of this random number in both cases you want to collect the you want to calculate the max action either q1 or q2 and use the update rule I showed you in the slides to update the estimates for q1 and q2 as you go at the end of the episode I'm sorry at the end of the step excuse me you want to reset the old observation to the new one so that way you can get the state up here and at the end of the episode you want to go ahead and decrease Epsilon if you're not familiar with this epsilon greedy is just a strategy for dealing with the Explorer exploit dilemma so an agent always has some estimate of the future rewards based on its model the environment or its experience playing the game if it's model free or a model of the problem right it can either explore or exploit its best known action so one way of dealing with the dilemma of how much time should you spend exploring versus how much time should you spend exploiting is to use something called epsilon greedy meaning that some percentage of time you explore or some percentage of the time you exploit and the way that you get it to settle on a greedy strategy is to gradually decrease that exploration parameter epsilon over time and that's what we're doing here and of course you want to keep track of the total rewards for that episode and recall in the carpol example the agent gets a reward of positive 1 every time the pole stays vertical and so every move that it doesn't flop over it gets one point and at the end you're gonna go ahead and plot your running averages so I'm gonna go ahead and run that and that'll take a minute while it's running I want to ask you guys a question so what type of material do you want to see from what I'm seeing in the data the the reinforcement learning stuff is immensely popular by other content not so much so I'm going to keep focusing on this type of stuff but are you happy seeing the Sutton and Bardot type into reading material or do you want to see more deep learning type material right there's a whole host of dozens of deep reinforcement learning algorithms we can cover but I'm actually quite content to cover this stuff because I believe that if you can't master the basics then the deep learning stuff isn't going to make sense anyway right because you have the complexity of deep learning on top of the complexity of the reinforcement learning material on top of it so if there's anything that particularly you guys want to see make sure to leave a comment below and hey if you haven't subscribed and you happen to like reinforcement learning and machine learning material please consider doing so if you liked the video make sure to leave a thumbs up if you thought it sucked go ahead and leave a thumbs down and tell me why I'm happy to answer the comments hence your objections and if you guys have suggestions for improvement I'm all ears and here we are it is finally finished with all hundred thousand episodes and you can see here the running average over the course of those games as you would expect the agent begins to learn fairly quickly balancing the cart pull more and more and more by about sixty thousand games it starts to hit the consistently hit the 200 move threshold where it is able to balance the cart pull all 200 moves of the game now recall this was with a gamma of 1.0 I'm going to go ahead and rerun this for the gamma of 0.9 and see how it does so burn this image into your brain and I'm gonna go ahead and check it out with a gamma of 0.9 and see if we can do any better pen we are back with the second run using a gamma of 0.9 and you can see something quite interesting here so it actually only kind of ever reaches the 200 mark just for a handful of games and then kind of stutters along actually decreasing in performance as it goes along so something funny is going on here and to be frank I off the top of my head and not entirely certain why so I invite you all to speculate however the what's also interesting is that I this is the second time of recording this I recorded it earlier and didn't scroll down the code so you ended up staring at the same chunk of stuff had to redo it and in that case I had a gamma is 0.9 as well and it seemed to work just fine so I suspect there's some significant variation here to do with the random number generator and it could just all be due to that right this is a complex space and it wanders around different portions this could happen potentially because it doesn't visit all areas of the parameter space enough times to get a reasonable estimate of the samples and there may be some type of bias on where it visits later on in the course of the episodes although that sounds kind of unlikely to me but either way that is double q-learning you can see how the hyper parameters actually affect the model it seems to have a fairly large effect as you might expect and the next video we're going to be taking a look at double sarsa so if you are not subscribed I ask you to please consider doing so hit the notification icon so you can see when I release that video I look forward to seeing you all in the next videowelcome back everybody to machine learning with Phil I am your host dr. Phil in yesterday's video we took a look as sarsa in the opening I Jim getting the cart Pole to balance itself as promised today we are looking at the algorithm of double Q learning also in the cart Pole opening IgM environment so we touched on Q learning many many months ago and the basic idea is that Q learning is a model free bootstrapped off policy learning algorithm what that means is model free it does not need to know it does not need the complete state transition dynamics of the environment to function it learns the game by playing it bootstrapped in that it doesn't need very many very much help getting started it generates estimates using its initial estimates which are totally arbitrary except for the terminal States all policy meaning that it is using a separate policy other than it is using a behavioral policy and the target policy to both learn about the environment and generate behavior respectively now when you deal with problems that when you deal with algorithms that take a maximizing approach to choosing actions you always get something called maximization bias so say you have some set of states with many different actions such that the action value function for that state and all actions is 0 then the agents estimate the Q capital Q of SN a can actually be will actually have some uncertainty to it and that uncertainty is actually a spread in the values right and that spread causes it to have some amount of positive bias and the max of the true values is 0 but the max of the capital Q the agents estimate is positive hence you have a positive bias and that can often be a problem in in reinforcement learning algorithms so this happens because you're using the same set of samples to max to determine the maximizing action as well as the value of that action and one way to solve this problem is to use two separate q functions to determine the max action and the value and you set up a relationship between them and then you alternate between them as you play the game so you're using one of them to determine the max action one of the determine its value and you alternate between them so that you eliminate the bias over time that's double Q learning in a nutshell the algorithm is the following so you initialize your alpha your epsilon where alpha is your learning rate epsilon is what you use for epsilon greedy you would initialize the Q 1 and Q 2 functions for all states and actions in your state and action space of course that's arbitrary except at the terminal states we must have a value of zero and you loop over your set of episodes and you initialize your state and for each episode right each step within the episode choose an action from using your state using an epsilon greedy strategy in the sum of Q 1 and Q 2 so you have the two separate key functions so if you're using single Q learning you would take the max action over just one Q but since you're dealing with two you have to account for that somehow right you could do a max you could do a sum you could do an average in this case we're gonna take the sum of the two Q functions take that action get your reward observe the new state and then with a 0.5 probability either update q1 or q2 according to this update rule here and of course at the end of the step go ahead and set your estate to the new state and keep looping until the game is done so clear as mud I hope so by the way if you want more reinforcement learning content make sure to subscribe hit the bell icon so you get notified let's get to it so next up we have our code and here we are inside of our code editor again I'm using Visual Studio code to take a look at our double Q learning script I'm not gonna be typing into the terminal I think that's probably a little bit annoying I'm just gonna review the code as we go along if you have seen my video on the source and algorithm there's gonna be a fair amount of overlap because we're solving the same set of problems over again the only real difference is in that video we use source to calculate the action value function and in this case we're using double Q learning again we have a max action function what this does is tells us the max action for a given state to construct that you make a numpy array out of a list that is for a given state in both actions and as we said in the video we're going to take the sum of the q1 and q2 for given state for both actions you want to take the arc max of that and recall in numpy the Arg max function if there is a tie returns the first element so if the left and right actions both have identical action value functions then returned the left action consistently that may or may not be a problem is just something to be aware of mm-hmm and once again we have to discretize the spaces recall that the carpool problem which is just a cart sliding along a track with a pole that is that must be maintained vertically right and the cart pole example we have a continuous space the X and the theta can be any number within a given range and likewise for the velocities to deal with that we have a couple options we could simply use neural networks to approximate those functions but in this case we're going to use a little trick to discretize the space so we're going to divide it up into ten equal chunks and any number that falls within a particular chunk will be assigned an integer so you'll go from a continuous to a discrete representation of your for vector the observation along with that comes a gets state observing state along that comes a get state function that you pass in the observation and it just uses those digitized spaces excuse me just uses those linear spaces to use the numpy digitize function to get the integer representation of the respective elements of your observation I've also added a function to plot the running average here I do this because in this R sub video we ended up with a little bit of a mess with 50,000 data points this will plot a running average over the prior 100 games next up we have to initialize our hyper parameters our learning rate of 0.1 this just controls the step size in the update equation the gamma is of course the discount factor the agent uses in its estimates of the future rewards so I don't believe this should actually be 0.9 IU I left it here because it's not super critical as far as I'm concerned it should really be 1.0 and the reason is that the purpose of discounting is to account for uncertainties in future rewards if you have some sequence of rewards with a probability of receiving them then it makes no sense to give each of those rewards equal weight because you don't know if you're going to get them in the cart poll example the rewards are certain as far as I'm aware of the state transition probabilities are one you know that if you move right you're going you end up moving right you know deterministically where the pole and the cart are gonna move so it shouldn't be discounted as far as I'm concerned epsilon is just the epsilon factor for our for our epsilon greedy algorithm and that's pretty much it for hyper parameters of the model next up we had to construct our state space so what this means oh baby's unhappy the state space is of course the the representation of the digitized space so we're gonna have for the cart position you're gonna have ten buckets the velocities ten buckets and likewise for the thetas thetas position and theta velocity so you're gonna have ten to the four possible states so 10,000 states and those are gonna be numbered all the way from 0 0 0 to 9999 that's all we're doing here is we're constructing the set of states next up we have to initialize our Q functions recall that the initialization is arbitrary except for the terminal state which must have a value of 0 the reason for this is that the terminal state by definition has a future value of 0 because you stopped playing the game right makes sense you could initialize this randomly you could initialize it with minus 1 plus 1 doesn't really matter so long as the terminal state is 0 for simplicity I'm initializing everything at 0 I'm gonna play a hundred thousand games the reason is that this algorithm eliminates bias but at the expense of convergence speed so you have to let it run a little bit longer an array for keeping track of the total rewards and we're gonna loop over a hundred thousand games printing out every 5,000 games to let us know it's still running I always want to reset you're done flying your rewards and reset the episode at the top and you're gonna loop over the episode getting your state calculating a random number for your epsilon greedy strategy you're gonna set the action to be the max action of Q 1 and Q 2 if the random number is less than 1 minus Epsilon otherwise you're gonna randomly sample your action space in any event you take that action get your new state reward and done flag and go ahead and tally up your reward and convert that observation to a state s Prime then go ahead and calculate a separate random number the purpose of this random number is to determine which q function we're going to update you know we're gonna be using one to calculate we're alternating between them because we have to eliminate the maximization bias right one is for finding the max action one is for finding the value of that action we alternate between episodes by way of this random number in both cases you want to collect the you want to calculate the max action either q1 or q2 and use the update rule I showed you in the slides to update the estimates for q1 and q2 as you go at the end of the episode I'm sorry at the end of the step excuse me you want to reset the old observation to the new one so that way you can get the state up here and at the end of the episode you want to go ahead and decrease Epsilon if you're not familiar with this epsilon greedy is just a strategy for dealing with the Explorer exploit dilemma so an agent always has some estimate of the future rewards based on its model the environment or its experience playing the game if it's model free or a model of the problem right it can either explore or exploit its best known action so one way of dealing with the dilemma of how much time should you spend exploring versus how much time should you spend exploiting is to use something called epsilon greedy meaning that some percentage of time you explore or some percentage of the time you exploit and the way that you get it to settle on a greedy strategy is to gradually decrease that exploration parameter epsilon over time and that's what we're doing here and of course you want to keep track of the total rewards for that episode and recall in the carpol example the agent gets a reward of positive 1 every time the pole stays vertical and so every move that it doesn't flop over it gets one point and at the end you're gonna go ahead and plot your running averages so I'm gonna go ahead and run that and that'll take a minute while it's running I want to ask you guys a question so what type of material do you want to see from what I'm seeing in the data the the reinforcement learning stuff is immensely popular by other content not so much so I'm going to keep focusing on this type of stuff but are you happy seeing the Sutton and Bardot type into reading material or do you want to see more deep learning type material right there's a whole host of dozens of deep reinforcement learning algorithms we can cover but I'm actually quite content to cover this stuff because I believe that if you can't master the basics then the deep learning stuff isn't going to make sense anyway right because you have the complexity of deep learning on top of the complexity of the reinforcement learning material on top of it so if there's anything that particularly you guys want to see make sure to leave a comment below and hey if you haven't subscribed and you happen to like reinforcement learning and machine learning material please consider doing so if you liked the video make sure to leave a thumbs up if you thought it sucked go ahead and leave a thumbs down and tell me why I'm happy to answer the comments hence your objections and if you guys have suggestions for improvement I'm all ears and here we are it is finally finished with all hundred thousand episodes and you can see here the running average over the course of those games as you would expect the agent begins to learn fairly quickly balancing the cart pull more and more and more by about sixty thousand games it starts to hit the consistently hit the 200 move threshold where it is able to balance the cart pull all 200 moves of the game now recall this was with a gamma of 1.0 I'm going to go ahead and rerun this for the gamma of 0.9 and see how it does so burn this image into your brain and I'm gonna go ahead and check it out with a gamma of 0.9 and see if we can do any better pen we are back with the second run using a gamma of 0.9 and you can see something quite interesting here so it actually only kind of ever reaches the 200 mark just for a handful of games and then kind of stutters along actually decreasing in performance as it goes along so something funny is going on here and to be frank I off the top of my head and not entirely certain why so I invite you all to speculate however the what's also interesting is that I this is the second time of recording this I recorded it earlier and didn't scroll down the code so you ended up staring at the same chunk of stuff had to redo it and in that case I had a gamma is 0.9 as well and it seemed to work just fine so I suspect there's some significant variation here to do with the random number generator and it could just all be due to that right this is a complex space and it wanders around different portions this could happen potentially because it doesn't visit all areas of the parameter space enough times to get a reasonable estimate of the samples and there may be some type of bias on where it visits later on in the course of the episodes although that sounds kind of unlikely to me but either way that is double q-learning you can see how the hyper parameters actually affect the model it seems to have a fairly large effect as you might expect and the next video we're going to be taking a look at double sarsa so if you are not subscribed I ask you to please consider doing so hit the notification icon so you can see when I release that video I look forward to seeing you all in the next video\n"