Using TensorFlow 2 to Implement Proximal Policy Optimization (PPO)
We don't have a dot mean method in tensorflow 2 so i will say actor loss equals tf math reduce mean actor loss and then our returns will be advantage plus values for that batch and our critic loss will be um why is this oh because i called it weighted probs let's call it weighted clipped problems because that's more accurate there we go okay and then it's unhappy because we haven't used critic loss yet i can understand that let's delete that line to get rid of that error.
Outside of the gradient tape we want to say actor params equals cell.actor.trainable variables and likewise for critic critic params equal self.critic trainable variables. The actor grads equals tape.gradient actor loss actor params critic grads equals tape.gradient critic loss critic params so we're getting the gradients with respect to our relevant parameters.
We say self.actor.optimizer.apply gradients zip patrograds actor params and likewise for the critic and we do still want to clear our memory and so there are no obvious errors here. So let us scroll up and see it's unhappy about missing white space okay did i reformat that in my cheat sheet i did so what i did was i yeah so this is something i don't like about the pep 8 style guidelines is it has to look like that so that it doesn't complain that's a little obnoxious to me and then this has to be indented and that goes there okay so that's happy and then it's unhappy about white space okay all right now it is no longer unhappy okay so now we can write quit out of that and check out our main file.
So we have to change this it's from agent import agent then of course i have to fix my formatting there sorry i'm just checking the main file over here to make sure i didn't change anything else. So there is missing there's an extra white space there and this has a white space there okay i kept the number of games the same i did change the name of this function so let's change that store transition and that's unhappy about the lines what did i do here okay oops okay.
So then we can scroll down it's unhappy about formatting here okay i'm a blank line and an extra line okay now it is happy okay so let's exit out of here and cross our fingers and run it to see how it does okay so it's unhappy about something it says prop ratio so the error is invalid argument area require requires broadcastable shapes at location unknown for operation subtract and that's in this line on 87 and learn that's new problems minus old problems so i'm going to take a look at my cheat sheet because i'm fairly certain i didn't blink that up and then i will take a look.
So we go back to agent that was line what 87. Okay so new probs is disk to vlog problems of actions okay we have that and then old probs is from our memory so then if the issue is with the dimensionality it would stand to reason that the old probs is probably messed up so if we scroll back up here old probs comes from here where i calculate the log prob and that's log problem numpy sub subzero okay oh i know what i did of course i don't want to return probs i want to return a log problem okay let's try it again that's some real time debugging for you okay.
And it starts off right off the bat playing uh that was my only mistake it's a new moon record okay so i'm gonna let that run for a little while and once it becomes obvious it's learning i will return to show you the results okay. So of course it finished and i realized i didn't do a make dur plots however you can see just by looking at the overall trend in the scores uh that the agent was indeed learning really is a question asked is our children learning am i right but uh we see scores of 200 116 138 173 so certainly better than you would get by chance and it does trend up over time.
So you can see they were in the beginning there were long stretches when it was getting relatively low scores although did seem to figure it out at least a little bit pretty quickly so you can see in the very beginning scores of around 50 or less and then progressing into scores in the mid 100 all the way up to 200 so okay that seems to me like pretty solid evidence that our tensorflow 2 agent and ppo works. If you like full details on how this stuff works check out the pytorch tutorial i didn't really want to go too verbose with this and repeat myself so check that video out if you need a deeper tutorial otherwise this was the tensorflow 2 implementation of ppo questions comments leave them down below and i'll see you in the next video.
"WEBVTTKind: captionsLanguage: enin today's video we are going to code a proximal policy optimization agent in tensorflow 2. we're not going to need any long-winded explanations in fact we're going to use my pi torch code as a basis for this agent so the first thing we want to do is a get clone of my youtube repo and that will give us access to the pi torch code so let's cd into the youtube code repository do a list go into reinforcement learning we want to go to policy gradient and ppo and then you can see see here we have a torch directory so let's make dur tf2 and then cd you know what we can actually do is uh copy torch forward slash star.pi into tf2 and then cdtf2 and do a list so you can see we have a main ppo torch and utils now if we take a look at the ppo torch file you can see we have the memory class we have the networks so an actor network and a critic network and we also have our agent class now i want to kind of rearrange the structure here because i no longer like doing it this way and i didn't really like it at the time i just did it because it's youtube but what we want to do is we want to copy ppo torch into i believe i called it memory.pi we want to copy ppo torch into networks dot pi and then we want to just move it into agent dot pi then if you do another list we can see we have agent main memory networks and utils much more clear and self-explanatory so we will start with the memory file because that is quite easy so the only thing we're going to need here is numpy we can get rid of all the other dependencies and then we will deal with these over indentations at a moment the first thing i want to do however is delete all of the superfluous code so we're going to get rid of all the networks and the agent so get rid of that and then all we're left with is the code for the ppo memory so then this is over indented so my version of my installation of vim follows most of the pep 8 style guide i silence the warnings for quite a few things because they're simply too annoying so let's do one indentation and see what it says it likes that so then let's just do that okay and so there are no other warnings so we can write quit out of that then we can take a look at the networks now here we won't need os or numpy in fact we won't need any of the torch stuff obviously because we're using tensorflow 2 and we can def we can delete not defeat we can delete our ppo memory and then we are left with the actor network critic network and the agent so let's get rid of the agent and then we will focus on rewriting our networks using tensorflow 2 instead of pi torch so let's scroll up the first thing i want to do is delete these save checkpoint functions because i've moved all of that into the agent class and the reason why we'll be apparent once we get there it's just a little bit cleaner to do it that way delete that okay so now we have a whole bunch of imports we have to take care of let's say tensorflow has tf uh import tensorflow.keras has keras and from tensorflow keras layers import dense so we're not going to be doing i should specify that we're going to be doing the discrete action case and we're not going to be doing any stuff with like say the atari library we're just going to do the carpool environment because it executes very very quickly and we can demonstrate that the agent works very very quickly so in in tensorflow 2 your class is derived from keras.model we don't need input dimms we won't need an alpha which is a learning rate because we're going to compile our model in the agent class and we also won't need that checkpoint directory because we'll be doing the model checkpointing within the agent class so then we can just keep all of those defaults get rid of this checkpoint file we will rewrite the network before we delete the code there so we will say self.fc1 equals dense and we don't have to specify any input dimensions because tensorflow 2 infers them but we do want to specify an activation function that's very handy you don't have to call a separate function to handle that then we can do a second fully connected layer sc2dims activation equals value once again and then we have our final layer which we will call i don't know i guess pi equals dense that takes n actions outputs and our activation function is a value excuse me sorry soft max it's staring me right in the face so then delete this we also don't need our optimizer we also do not need our optimizer device or the self.2 command other thing i want to do is uh rename this function as call that will allow us to use the name of our object to perform the forward propagation so then for our feed forward let's delete this and say x equals self.fc1 state dot fc2 x excel.fc3 and then just return that x that isn't the best form wow there's a whole lot of white space here good grief let's do that okay so um we're just going to leave it as the name x because it's just basically a dummy variable you know it's the uh actually this is an issue we have to call that pi right actually in my cheat sheet i did call it fc3 so let's just be consistent with that so next up we do something similar for the critic network so derived from keras dot model we once again do not need input dimms or a learning rate alpha or a checkpoint directory we will keep the default parameters and we don't need the file joining operation so we'll define our layers very s uh similar to what we have in the actor network and our output is a single value with no activation function because we want the actual value of the state according to the deep neural network then we can get rid of all of this stuff because we won't need it rename this call and say x sorry that should be state and then return x so let's see if i have any other apparently i do not use tensorflow okay so we'll get rid of that import and then that is good to go so we have our memory we have our networks now we have to deal with our agent okay so i didn't delete the other stuff in the agent file let's start with that so we'll get rid of the ppo memory the actor network critic network and then we're left with our agent class so we don't need any of the torch stuff we don't need the os stuff we will need numpy we'll need tensorflow as tf keras we'll need our atom optimizer uh one other note you will need the tensorflow probability package for this i'll do a pip list so you can see what it looks like but that is necessary for this agent we also need our memory ppo memory and our networks our actor and critic network um so then for our agent we have a bunch of default parameters we have our policy clip batch size number of epochs we will need a checkpoint directory i think this line is too long isn't it so let's fix that and then come down to a new line and say checkpoint directory equals models so then we have our gamma policy clip number of epochs why is it unhappy oh it's under indented of course okay save our gamma policy clip number of epochs g a e lambda that is for our generalized advantage estimation checkpointer equals checkpoint der we have our actor network we don't need to pass in our input dimms or our alpha likewise for our critic network we only need to call the constructor and then we can handle uh compiling them so we'll say self.actor.compile optimizer opt mizer equals atom learning rate equals alpha and then we're going to compile our critic and obviously it's a hybrid parameter the learning rate of both neural networks to play with i've used a single value for both networks and it works just fine in the cart poll it stands to reason for more complex environments you're going to need you know different learning rates perhaps so our memory stays the same and then i'm going to change the name of this function i don't like the name remember anymore i'm going to call it store transition it does the same thing so then we have our functions to save and load models so these are going to be slightly different what we want to do is say self.actor.save checkpoint directory plus actor and plus critic and then for loading our models it's unhappy oh i have an equal sign i didn't press the shift key of course so then we say self.actor equals keras models load model checkpoint directory plus actor okay so then there are no mistakes there um so now we come down to our choose action function and this will be a little bit different so obviously it is not a pytorch tensor it is a tensorflow convert to tensor we do want to add the batch dimension to our observation we don't have to specify a d type and we certainly don't have to send it to a device because this is tensorflow 2. then actually let's delete all of this so i don't get confused i will say our probabilities equals cell.actor state our distribution is tfp distributions dot categorical based on the probability supplied by our network the action is going to be a sampling of that distribution and we will need the log prop as in the log of the probability of taking that action and we want the value of that action according to our critic network and then we want to convert all those we want to get the values out of those more specifically so we're going to dereference them with numpy so we say action equals action dot numpy this is like saying dot item in pi torch and log prob equals log prob dot numpy and we have to take the zeroth element because it returns an array so then we return our action log prob and value and then we come down to our learn function let me press escape first to make sure i didn't make any obvious mistakes there okay it looks good so our learn function is going to be a little bit different so the iteration over the number of epochs sampling the memory converting all this stuff here is going to stay the same a couple things we don't want to do is we don't want to convert the advantage in values to tensors reason being is that tensorflow 2 can deal quite happily with numpy arrays in the loss function and also the tensorflow 2 tensors don't support indexing in the same way that numpy arrays do so rather than rewrite this code to accommodate that fact i'm going to leave the advantage and values arrays as numpy arrays because they work just fine in tensorflow too so delete that and then we come down here for batch and batches so this is where our actual learning comes in so the first thing i want to do is account for the fact that we are in fact using tensorflow 2 and i will do so by using the gradient tape context manager so we say with tf gradient tape dent equals true as tape now what that persistent equals true does is let me just indent all of that is allows us to back propagate twice and we want to do that because we're going to stick all of our calculations for the gradients and the loss excuse me all of our calculations for the loss function within this context manager because in tensorflow 2 only things in this context manager get counted towards the loss function so the first thing i'll do is handle the conversion to tensors so convert to tensor state array batch and we don't need the data type or the dot 2 method similarly for the other parameters here convert to tensor we don't need any of that we don't need any of this and here we have fat tf convert to tensor so then uh i'm gonna come down and actually get rid of the total loss uh the back propagation and the stepping of the optimizers because we don't do any of that stuff in pi torch excuse me tensorflow 2 and i'm going to delete those useless lines at the end of the file and then get rid of my white space just to get rid of some warnings okay so now we come back up we have our states old props and actions so i'm going to rename this call it probs because we're not returning a distribution we're returning probabilities equals self.actor states then we're going to call our distribution t distributions categorical problems and get our new probabilities so down here it has new problems equals disk.log prop actions where if you recall we said dist equals self.actor.self.actor of states so in the pi torch implementation the forward propagation returned an actual distribution that we could use to take the log of the probability of the actions that were actually taken during this batch we don't have that luxury here so what we're doing is getting the probabilities defined by our softmax and then creating a new distribution this is just kind of the easiest way to do it you don't want to stick the distribution inside of the call function for the actor network because then when you go to save the model it barks at you that it can't save a distribution or a tensorflow object other than a tensor i don't remember the exact details of the message but the bottom line was it had to be a tensor to save and so you have to call the distribution here instead of in the call function it's a long-winded way of saying that's why you got to do that so we have dist and then we have new probs equals dist.log prob actions so i'll put it there and delete the line there we don't want to do it twice there's no point in that then we have critic value equals self.critic states and we're going to need to squeeze that because we have a batch dimension so we'll say tf.squeeze and we want to specify the first dimension so then our prob ratio is i'm going to delete this line and use this we're going to say tf.math.exponential of the difference between new probs and old so then we have i'm tired i'm probably misspeaking sorry so then we have weighted problems equals advantage sub batch minus or excuse me times prob ratio we'll keep that the same then i'm going to call this weighted probabilities um do i want to how do i want to do that it's unhappy undefined name t because we have to call this tf dot clip by value instead of clamp and then it's prob ratio one minus that one minus our policy clip and then one plus policy clip so to clip it between the range 0.8 and 1.2 because our balls eclipse defaults to 0.2 and then i'm going to move this line down and say weighted probes equals clipped probs times the advantage of the batch and i'm going to change this to clipped props the reason being is the pep8 style guidelines will get very upset at me for various reasons having the multiplication multiplication operator on one line and then stuff on the other it's first world problems to be sure so then uh now we have the actor loss and so that will be negative tf math.minimum i made that mistake earlier weighted probs weight declined probs and we don't have a dot mean method in tensorflow 2. so i will say actor loss equals tf math reduce mean actor loss and then our returns will be advantage plus values for that batch and our critic loss will be um why is this oh because i called it weighted probs let's call it weighted clipped problems because that's more accurate there we go okay and then our critic loss will be mean squared error between our critic value and those returns okay so then it's unhappy because we haven't used critic loss yet i can understand that let's delete that line to get rid of that error then outside of the gradient tape we want to say actor params equals cell.actor.trainable variables and likewise for critic critic params equal self.critic trainable variables the actor grads equals tape.gradient actor loss actor params critic grads equals tape.gradient critic loss critic params so we're getting the gradients with respect to our relevant parameters then we say self.actor.optimizer.apply gradients zip patrograds actor params and likewise for the critic and we do still want to clear our memory and so there are no obvious errors here so let us scroll up and see it's unhappy about missing white space okay did i reformat that in my cheat sheet i did so what i did was i yeah so this is something i don't like about the pep 8 style guidelines is it has to look like that so that it doesn't complain that's a little obnoxious to me and then this has to be indented and that goes there okay so that's happy and then it's unhappy about white space okay all right now it is no longer unhappy okay so now we can write quit out of that and check out our main file so we have to change this it's from agent import agent then of course i have to fix my formatting there sorry i'm just checking the main file over here to make sure i didn't change anything else so there is missing there's an extra white space there and this has a white space there okay i kept the number of games the same i did change the name of this function so let's change that store transition and that's unhappy about the lines what did i do here okay oops okay so then we can scroll down it's unhappy about formatting here okay i'm a blank line and an extra line okay now it is happy okay so let's exit out of here and cross our fingers and run it to see how it does okay so it's unhappy about something it says prop ratio so the error is invalid argument area require requires broadcastable shapes at location unknown for operation subtract and that's in this line on 87 and learn that's new problems minus old problems so i'm going to take a look at my cheat sheet because i'm fairly certain i didn't blink that up and then i will take a look so then we go back to agent that was line what 87. okay so new probs is disk to vlog problems of actions okay we have that and then old probs is from our memory so then if the issue is with the dimensionality it would stand to reason that the old probs is probably messed up so if we scroll back up here old probs comes from here where i calculate the log prob and that's log problem numpy sub subzero okay oh i know what i did of course i don't want to return probs i want to return a log problem okay let's try it again that's some real time debugging for you okay and it starts off right off the bat playing uh that was my only mistake it's a new moon record okay so i'm gonna let that run for a little while and once it becomes obvious it's learning i will return to show you the results okay so of course it finished and i realized i didn't do a make dur plots however you can see just by looking at the overall trend in the scores uh that the agent was indeed learning really is a question asked is our children learning am i right but uh we see scores of 200 116 138 173 so certainly better than you would get by chance and it does trend up over time so you can see they were in the beginning there were long stretches when it was getting relatively low scores although did seem to figure it out at least a little bit pretty quickly so you can see in the very beginning scores of around 50 or less and then progressing into scores in the mid 100 all the way up to 200 so okay that seems to me like pretty solid evidence that our tensorflow 2 agent and ppo works if you like full details on how this stuff works check out the pytorch tutorial i didn't really want to go too verbose with this and repeat myself so check that video out if you need a deeper tutorial otherwise this was the tensorflow 2 implementation of ppo questions comments leave them down below and i'll see you in the next videoin today's video we are going to code a proximal policy optimization agent in tensorflow 2. we're not going to need any long-winded explanations in fact we're going to use my pi torch code as a basis for this agent so the first thing we want to do is a get clone of my youtube repo and that will give us access to the pi torch code so let's cd into the youtube code repository do a list go into reinforcement learning we want to go to policy gradient and ppo and then you can see see here we have a torch directory so let's make dur tf2 and then cd you know what we can actually do is uh copy torch forward slash star.pi into tf2 and then cdtf2 and do a list so you can see we have a main ppo torch and utils now if we take a look at the ppo torch file you can see we have the memory class we have the networks so an actor network and a critic network and we also have our agent class now i want to kind of rearrange the structure here because i no longer like doing it this way and i didn't really like it at the time i just did it because it's youtube but what we want to do is we want to copy ppo torch into i believe i called it memory.pi we want to copy ppo torch into networks dot pi and then we want to just move it into agent dot pi then if you do another list we can see we have agent main memory networks and utils much more clear and self-explanatory so we will start with the memory file because that is quite easy so the only thing we're going to need here is numpy we can get rid of all the other dependencies and then we will deal with these over indentations at a moment the first thing i want to do however is delete all of the superfluous code so we're going to get rid of all the networks and the agent so get rid of that and then all we're left with is the code for the ppo memory so then this is over indented so my version of my installation of vim follows most of the pep 8 style guide i silence the warnings for quite a few things because they're simply too annoying so let's do one indentation and see what it says it likes that so then let's just do that okay and so there are no other warnings so we can write quit out of that then we can take a look at the networks now here we won't need os or numpy in fact we won't need any of the torch stuff obviously because we're using tensorflow 2 and we can def we can delete not defeat we can delete our ppo memory and then we are left with the actor network critic network and the agent so let's get rid of the agent and then we will focus on rewriting our networks using tensorflow 2 instead of pi torch so let's scroll up the first thing i want to do is delete these save checkpoint functions because i've moved all of that into the agent class and the reason why we'll be apparent once we get there it's just a little bit cleaner to do it that way delete that okay so now we have a whole bunch of imports we have to take care of let's say tensorflow has tf uh import tensorflow.keras has keras and from tensorflow keras layers import dense so we're not going to be doing i should specify that we're going to be doing the discrete action case and we're not going to be doing any stuff with like say the atari library we're just going to do the carpool environment because it executes very very quickly and we can demonstrate that the agent works very very quickly so in in tensorflow 2 your class is derived from keras.model we don't need input dimms we won't need an alpha which is a learning rate because we're going to compile our model in the agent class and we also won't need that checkpoint directory because we'll be doing the model checkpointing within the agent class so then we can just keep all of those defaults get rid of this checkpoint file we will rewrite the network before we delete the code there so we will say self.fc1 equals dense and we don't have to specify any input dimensions because tensorflow 2 infers them but we do want to specify an activation function that's very handy you don't have to call a separate function to handle that then we can do a second fully connected layer sc2dims activation equals value once again and then we have our final layer which we will call i don't know i guess pi equals dense that takes n actions outputs and our activation function is a value excuse me sorry soft max it's staring me right in the face so then delete this we also don't need our optimizer we also do not need our optimizer device or the self.2 command other thing i want to do is uh rename this function as call that will allow us to use the name of our object to perform the forward propagation so then for our feed forward let's delete this and say x equals self.fc1 state dot fc2 x excel.fc3 and then just return that x that isn't the best form wow there's a whole lot of white space here good grief let's do that okay so um we're just going to leave it as the name x because it's just basically a dummy variable you know it's the uh actually this is an issue we have to call that pi right actually in my cheat sheet i did call it fc3 so let's just be consistent with that so next up we do something similar for the critic network so derived from keras dot model we once again do not need input dimms or a learning rate alpha or a checkpoint directory we will keep the default parameters and we don't need the file joining operation so we'll define our layers very s uh similar to what we have in the actor network and our output is a single value with no activation function because we want the actual value of the state according to the deep neural network then we can get rid of all of this stuff because we won't need it rename this call and say x sorry that should be state and then return x so let's see if i have any other apparently i do not use tensorflow okay so we'll get rid of that import and then that is good to go so we have our memory we have our networks now we have to deal with our agent okay so i didn't delete the other stuff in the agent file let's start with that so we'll get rid of the ppo memory the actor network critic network and then we're left with our agent class so we don't need any of the torch stuff we don't need the os stuff we will need numpy we'll need tensorflow as tf keras we'll need our atom optimizer uh one other note you will need the tensorflow probability package for this i'll do a pip list so you can see what it looks like but that is necessary for this agent we also need our memory ppo memory and our networks our actor and critic network um so then for our agent we have a bunch of default parameters we have our policy clip batch size number of epochs we will need a checkpoint directory i think this line is too long isn't it so let's fix that and then come down to a new line and say checkpoint directory equals models so then we have our gamma policy clip number of epochs why is it unhappy oh it's under indented of course okay save our gamma policy clip number of epochs g a e lambda that is for our generalized advantage estimation checkpointer equals checkpoint der we have our actor network we don't need to pass in our input dimms or our alpha likewise for our critic network we only need to call the constructor and then we can handle uh compiling them so we'll say self.actor.compile optimizer opt mizer equals atom learning rate equals alpha and then we're going to compile our critic and obviously it's a hybrid parameter the learning rate of both neural networks to play with i've used a single value for both networks and it works just fine in the cart poll it stands to reason for more complex environments you're going to need you know different learning rates perhaps so our memory stays the same and then i'm going to change the name of this function i don't like the name remember anymore i'm going to call it store transition it does the same thing so then we have our functions to save and load models so these are going to be slightly different what we want to do is say self.actor.save checkpoint directory plus actor and plus critic and then for loading our models it's unhappy oh i have an equal sign i didn't press the shift key of course so then we say self.actor equals keras models load model checkpoint directory plus actor okay so then there are no mistakes there um so now we come down to our choose action function and this will be a little bit different so obviously it is not a pytorch tensor it is a tensorflow convert to tensor we do want to add the batch dimension to our observation we don't have to specify a d type and we certainly don't have to send it to a device because this is tensorflow 2. then actually let's delete all of this so i don't get confused i will say our probabilities equals cell.actor state our distribution is tfp distributions dot categorical based on the probability supplied by our network the action is going to be a sampling of that distribution and we will need the log prop as in the log of the probability of taking that action and we want the value of that action according to our critic network and then we want to convert all those we want to get the values out of those more specifically so we're going to dereference them with numpy so we say action equals action dot numpy this is like saying dot item in pi torch and log prob equals log prob dot numpy and we have to take the zeroth element because it returns an array so then we return our action log prob and value and then we come down to our learn function let me press escape first to make sure i didn't make any obvious mistakes there okay it looks good so our learn function is going to be a little bit different so the iteration over the number of epochs sampling the memory converting all this stuff here is going to stay the same a couple things we don't want to do is we don't want to convert the advantage in values to tensors reason being is that tensorflow 2 can deal quite happily with numpy arrays in the loss function and also the tensorflow 2 tensors don't support indexing in the same way that numpy arrays do so rather than rewrite this code to accommodate that fact i'm going to leave the advantage and values arrays as numpy arrays because they work just fine in tensorflow too so delete that and then we come down here for batch and batches so this is where our actual learning comes in so the first thing i want to do is account for the fact that we are in fact using tensorflow 2 and i will do so by using the gradient tape context manager so we say with tf gradient tape dent equals true as tape now what that persistent equals true does is let me just indent all of that is allows us to back propagate twice and we want to do that because we're going to stick all of our calculations for the gradients and the loss excuse me all of our calculations for the loss function within this context manager because in tensorflow 2 only things in this context manager get counted towards the loss function so the first thing i'll do is handle the conversion to tensors so convert to tensor state array batch and we don't need the data type or the dot 2 method similarly for the other parameters here convert to tensor we don't need any of that we don't need any of this and here we have fat tf convert to tensor so then uh i'm gonna come down and actually get rid of the total loss uh the back propagation and the stepping of the optimizers because we don't do any of that stuff in pi torch excuse me tensorflow 2 and i'm going to delete those useless lines at the end of the file and then get rid of my white space just to get rid of some warnings okay so now we come back up we have our states old props and actions so i'm going to rename this call it probs because we're not returning a distribution we're returning probabilities equals self.actor states then we're going to call our distribution t distributions categorical problems and get our new probabilities so down here it has new problems equals disk.log prop actions where if you recall we said dist equals self.actor.self.actor of states so in the pi torch implementation the forward propagation returned an actual distribution that we could use to take the log of the probability of the actions that were actually taken during this batch we don't have that luxury here so what we're doing is getting the probabilities defined by our softmax and then creating a new distribution this is just kind of the easiest way to do it you don't want to stick the distribution inside of the call function for the actor network because then when you go to save the model it barks at you that it can't save a distribution or a tensorflow object other than a tensor i don't remember the exact details of the message but the bottom line was it had to be a tensor to save and so you have to call the distribution here instead of in the call function it's a long-winded way of saying that's why you got to do that so we have dist and then we have new probs equals dist.log prob actions so i'll put it there and delete the line there we don't want to do it twice there's no point in that then we have critic value equals self.critic states and we're going to need to squeeze that because we have a batch dimension so we'll say tf.squeeze and we want to specify the first dimension so then our prob ratio is i'm going to delete this line and use this we're going to say tf.math.exponential of the difference between new probs and old so then we have i'm tired i'm probably misspeaking sorry so then we have weighted problems equals advantage sub batch minus or excuse me times prob ratio we'll keep that the same then i'm going to call this weighted probabilities um do i want to how do i want to do that it's unhappy undefined name t because we have to call this tf dot clip by value instead of clamp and then it's prob ratio one minus that one minus our policy clip and then one plus policy clip so to clip it between the range 0.8 and 1.2 because our balls eclipse defaults to 0.2 and then i'm going to move this line down and say weighted probes equals clipped probs times the advantage of the batch and i'm going to change this to clipped props the reason being is the pep8 style guidelines will get very upset at me for various reasons having the multiplication multiplication operator on one line and then stuff on the other it's first world problems to be sure so then uh now we have the actor loss and so that will be negative tf math.minimum i made that mistake earlier weighted probs weight declined probs and we don't have a dot mean method in tensorflow 2. so i will say actor loss equals tf math reduce mean actor loss and then our returns will be advantage plus values for that batch and our critic loss will be um why is this oh because i called it weighted probs let's call it weighted clipped problems because that's more accurate there we go okay and then our critic loss will be mean squared error between our critic value and those returns okay so then it's unhappy because we haven't used critic loss yet i can understand that let's delete that line to get rid of that error then outside of the gradient tape we want to say actor params equals cell.actor.trainable variables and likewise for critic critic params equal self.critic trainable variables the actor grads equals tape.gradient actor loss actor params critic grads equals tape.gradient critic loss critic params so we're getting the gradients with respect to our relevant parameters then we say self.actor.optimizer.apply gradients zip patrograds actor params and likewise for the critic and we do still want to clear our memory and so there are no obvious errors here so let us scroll up and see it's unhappy about missing white space okay did i reformat that in my cheat sheet i did so what i did was i yeah so this is something i don't like about the pep 8 style guidelines is it has to look like that so that it doesn't complain that's a little obnoxious to me and then this has to be indented and that goes there okay so that's happy and then it's unhappy about white space okay all right now it is no longer unhappy okay so now we can write quit out of that and check out our main file so we have to change this it's from agent import agent then of course i have to fix my formatting there sorry i'm just checking the main file over here to make sure i didn't change anything else so there is missing there's an extra white space there and this has a white space there okay i kept the number of games the same i did change the name of this function so let's change that store transition and that's unhappy about the lines what did i do here okay oops okay so then we can scroll down it's unhappy about formatting here okay i'm a blank line and an extra line okay now it is happy okay so let's exit out of here and cross our fingers and run it to see how it does okay so it's unhappy about something it says prop ratio so the error is invalid argument area require requires broadcastable shapes at location unknown for operation subtract and that's in this line on 87 and learn that's new problems minus old problems so i'm going to take a look at my cheat sheet because i'm fairly certain i didn't blink that up and then i will take a look so then we go back to agent that was line what 87. okay so new probs is disk to vlog problems of actions okay we have that and then old probs is from our memory so then if the issue is with the dimensionality it would stand to reason that the old probs is probably messed up so if we scroll back up here old probs comes from here where i calculate the log prob and that's log problem numpy sub subzero okay oh i know what i did of course i don't want to return probs i want to return a log problem okay let's try it again that's some real time debugging for you okay and it starts off right off the bat playing uh that was my only mistake it's a new moon record okay so i'm gonna let that run for a little while and once it becomes obvious it's learning i will return to show you the results okay so of course it finished and i realized i didn't do a make dur plots however you can see just by looking at the overall trend in the scores uh that the agent was indeed learning really is a question asked is our children learning am i right but uh we see scores of 200 116 138 173 so certainly better than you would get by chance and it does trend up over time so you can see they were in the beginning there were long stretches when it was getting relatively low scores although did seem to figure it out at least a little bit pretty quickly so you can see in the very beginning scores of around 50 or less and then progressing into scores in the mid 100 all the way up to 200 so okay that seems to me like pretty solid evidence that our tensorflow 2 agent and ppo works if you like full details on how this stuff works check out the pytorch tutorial i didn't really want to go too verbose with this and repeat myself so check that video out if you need a deeper tutorial otherwise this was the tensorflow 2 implementation of ppo questions comments leave them down below and i'll see you in the next video\n"