How to Code RL Agents Like DeepMind

The world of deep reinforcement learning is vast and complex, with various frameworks and architectures being used to train agents to perform tasks such as playing video games or navigating complex environments. In this article, we will explore one such framework that has gained significant attention in recent times - the DeepMind Agent Framework.

The author of the article starts by explaining how to create a simple agent using the Sonnet library, which is a popular open-source framework for building reinforcement learning agents. The author explains that to get started, you need to import the necessary libraries and define your environment spec actions shape and cast it as an int batch concat. This is followed by defining the policy network sonnet sequential, which is where Sonic comes in. The author explains that Sonic is a simple neural network architecture that consists of three layers of shape 64x64xdimms, each with a layer norm linear layer.

The next part of the article discusses the critic network, another sequential network that is used to estimate the value function of the agent. The author explains that the critic multiplexer concatenates the observations and actions, followed by another layer norm mlp of shape 64x64x1. The final activation is a tanh function, which outputs a single value. The author notes that this critic network is not the same as the policy network, but rather serves as a separate component to estimate the value function.

The article then moves on to discuss the agent itself, which is a DDPG (Deep Deterministic Policy Gradient) agent. The author explains that the agent requires a few more components, including an env spec up, a policy network, and a critic network. The env spec up is used to specify the environment and its parameters, while the policy network is used to generate actions based on the current state of the environment. The critic network is used to estimate the value function of the agent.

The author notes that there are some important details to consider when building an agent using this framework. For example, they note that it's essential to handle compiling and other details under the hood, which gets handled by Sonnet. They also mention that checkpointing is not enabled by default, but can be useful for learning.

One of the most interesting aspects of this framework is how easy it is to use, even for complex algorithms like Never Give Up. The author notes that they were able to run a simple agent using this framework and achieve decent results without having to worry about many details.

However, the author also notes that there are some limitations to this framework. For example, they mention that the hyperparameters are not tuned in the provided code, which may lead to suboptimal performance. They also note that running agents in parallel is not trivial, and would require additional work to implement.

In conclusion, the DeepMind Agent Framework is a powerful tool for building reinforcement learning agents. While it's designed to be easy to use, there are still some complexities and limitations that need to be addressed. The author hopes that this article has provided a useful introduction to this framework, and invites readers to join them on their journey as they explore more complex algorithms like Never Give Up.

Finally, the author mentions that DeepMind is known for keeping some of their secrets under wraps, including details about how they run agents in parallel. They also note that tuning hyperparameters can be crucial to achieving good performance with this framework. Overall, the article provides a comprehensive overview of the DeepMind Agent Framework and its capabilities.

One thing that stands out from the provided text is the author's enthusiasm for the topic. They seem genuinely excited about exploring deep reinforcement learning and are eager to share their knowledge with others. The tone of the article is informal and conversational, making it feel more like a discussion between friends than a formal tutorial.

The author also mentions that they will be creating videos in the coming weeks and months as they delve deeper into topics related to this framework. They invite readers to subscribe to their channel and share any questions or comments they may have. Overall, the article feels like a starting point for someone looking to explore deep reinforcement learning, rather than a comprehensive guide.

The text also highlights the complexity of deep reinforcement learning at scale, as demonstrated by DeepMind's efforts in this area. The author notes that running agents in parallel is not trivial and requires careful consideration. This is an important aspect of building complex systems, and one that should be taken seriously when exploring these types of projects.

"WEBVTTKind: captionsLanguage: enin today's video we are going to use deepmind's very own deep reinforcement learning framework acme to code up a couple of simple reinforcement learning agents now as you might imagine acme is an incredibly complex framework as such we are not going to have a whole lot of time to go over all of the intricacies of it today instead in a couple future videos i will look at how to create custom reinforcement learning agents using this framework in particular we're going to implement soft androcritic and td3 as these have not been implemented by the deep mind team now if you don't know anything about those two algorithms never fear i have a couple of videos linked over here in the top left so the first thing we are going to want to do is copy the code from the github so we're going to copy the git link and then head to the terminal and do a git clone so we'll do that first we will get clone and that will clone into the acme directory and then we're going to want to create our very own virtual environment for this reason being there are a number of dependencies for this which will probably mess up your base python installation if you should do this just willy-nilly without a virtual environment so we'll do making our own virtual environment we will then activate that of course we have the cd into acme and then do source spin activate and then it will work then we have to upgrade our pip and then we can begin the process of installing the deep mind acme framework now keep in mind that being google this is implemented in tensorflow but they also have dependencies and code for the jaxx library which honestly i don't know really anything about it so we're just going to stick with the tensorflow stuff so we'll say pip install dmacm tensorflow and then we'll do pip install dm acme envs this will give us some of the base environments we will have to install jim in a moment i'll get to that in a few minutes now acme is built on top of many other packages written by deepmind one of which is something called reverb now reverb is a framework for serving machine learning data and the basic idea is that you have a server that serves data to any number of clients and we're going to be adding data tables to this when we code our own custom agents for the code we're running today it's already implemented for us but the basic idea is you create data tables within this server and then sample that data using whatever strategy you want like say prioritize experience replay or n-step transitions anything like that using any number of servers it's highly scalable it's an incredibly powerful technology and everything we're going to be doing is built atop that and so then we also need something called sonnet now sonnet sits atop of tensorflow as a sort of additional framework for creating deep neural networks i haven't played around a whole lot with it i've used it for these projects but it's not something i've done a deep dive on and i don't really know what all is going on with it so next we're going to want to install tensorflow and i'm going to specify a particular version reason being the 2.8.0 release candidate has some conflicts with some of the other codes so i'm going to specify 2.7 because i know that works now of course if you're watching this in the future maybe the 2.8 full release has no issues maybe they've ironed everything out but for now let's stick with what works next we will need tensorflow probability courses handles probability distributions for policies next we need to install something called checks i don't know what that is honestly i didn't have the energy to look it up next we have trfl truffle that is a tensorflow reinforcement learning agent package and then we want to do an upgrade for jim atari and it's really a downgrade because we're running 0.21 we want to downgrade to 0.19 because as of 0.21 or perhaps 0.20 they quit including the roms with the gym environment so when we try to run pong or any other atari game we're going to get an error so let's downgrade to 0.19 so we should be good to go so now we can do a list go to the examples we can go to atari and then we can see we have a whole bunch of files here we have run dqn run impala and run r2d2 we're just going to run a dqn here to make sure it works and i'm expecting this error here so now this is easy to deal with what it's telling us is that we don't have a particular shared object file so of course i am running linux kubuntu 20 if you're curious and so i haven't set up the ld library path environment variable in this particular virtual environment so we can check that by saying echo ld library library library path and i believe i need the dollar sign here because that's a variable and it tells us it points to uh cuda 10-10.1 so then we can locate that particular file to see where it lives and why it's unhappy so we'll just do live python 3.7 and we can see that it lives in this particular directory so what we're going to do is export ld library path equals that directory colon dollar ld library path and so let me move my face out of the way here the colon ld library path will with the dollar sign we'll append the current variable for the ld library path and we can verify that by echoing it to the terminal and we see right here that it now includes both directories as we want it to do now if you're running windows i don't know the fix for this i don't use windows except for some light gaming every now and again you should probably upgrade to a real operating system preferably linux although probably mac is going to be up more people's alley but nonetheless let's try running this again okay so now it is starting the replay the reverb replay server so i'm saving some checkpoints and i can see it's spooling up my cpu usage now so that tells me it is running okay so we do get some output to the terminal uh you can't see it but it's telling us uh table priority priority tables access without grbc it's not doing any calls over the internet because it's just my local machine it has some code that indicates it is learning by outputting to the terminal and then it also does some output from each individual episode telling you the return the length and all that good stuff so let's stop that code it takes a second to shut down the replay server and then we're going to code up our own dqm agent as well as our own ddpg agent to see really how easy it is to get started with this so we're going to get out of this directory and we're going to go back here just to the base acme directory just for giggles so we're going to them into dqn dot pi and we're going to start with our imports we're going to say from absol import app we're going to use app.run to run our program let me zoom in a little bit for you guys and then we're going to import funk tools you'll see why we need that in a little bit now all of our environments are going to be wrapped if you've taken my course on deep q learning you probably understand that it just means that we're going to take the output from the gym environments and then um apply operations like say converting the screen images from integers to floats by dividing by 255 we're going to stack frames and repeat actions that kind of stuff so acme.agents.tf import dqn so we're going to import the actual agent and from tensorflow we're going to import the networks now as i said um sorry i can't talk and type at the same time as i said we're going to be using all of the building blocks they give us and so the structure of their program is a little bit different than what i typically code on this channel because they are professionals and they know best uh their structure is that they have an agent that has both an actor and a learner so the actor is basically the policy and the learner basically collects uh the data from the reverb data table and uses that to update the deep neural networks and those networks are defined in that package networks excuse me that sub package networks and so we are going to be assembling all these building blocks into a deep q learning agent that we're going to test so the first thing we want is a function to really wrap all of our environments and this is written all of our code is written using uh i guess this type of syntax of specifying the typing came about in like 3.7 or maybe 3.6 something like that which i really like it is one of my biggest complaints about python is that it's not a strongly typed language now of course this doesn't make it strongly typed but it goes a long way towards rectifying that issue so here we're going to specify our level no frame skip dash v4 and they have something called an oar wrapper what that does is it takes the time step from the environment and wraps it into an observation action reward tuple they use it for some other agents but not for the deep q learning agent it's a little bit i find a little bit inconvenient obviously they do it because they find it convenient but for someone from the outside looking in it's a little bit difficult to wrap your mind about around maybe i'm just a i don't know other thing to note is this arrow and then the return type that tells the python interpreter what to expect for a return type now as far as the typing what i'm talking about specifically as a brief aside is that in python you can pass in a variable to a function and change its data type from say a float to a string or from a custom object to a boolean and python never says hey did you really mean to do that and it can result in some subtle bugs down the line if you write your code in such a way that it's not strongly dependent on a particular data type so sometimes strings will evaluate in certain ways versus floats and it can make it inconvenient to track down bugs so i'm glad they at least uh inserted some features to at least improve the type checking capabilities so we want to make our environment uh level full action space equals true i believe what that means is that in the um the ale imports the ale environments themselves you have a total of 18 actions for all the atari games whereas in some of the open ai implementations they reduce it down to just the stuff that actually does something so you don't have like 16 no ops and then two actions that actually do anything so then we'll say max episode length equals 108 000 if evaluation else 50 000 and then we're going to make a wrapper list that just tells the module what we want to do so we'll say wrappers.jim atari wrapper that is going to handle all the good stuff like stacking frames converting to grayscale resizing and then funktools.partial this will apply the atari wrapper and to float equals true that means we're going to convert the observations to a floating point number instead of integers and this zero discount on life loss i believe what that means is it sets the reward to zero the total discount uh to zero at the end of the episode um so that the loss of life actually affects the agent that's my understanding again i haven't mastered everything within this framework so take that in mind if you are a little bit more familiar with it and i say something incorrect please correct me in the comments i don't claim to be an expert on something that deep mind wrote i'm just using it uh to learn how the experts do stuff so then we're going to wrap all of our wrappers into our environment and return it and then we have our main function and that's not a typo you do need that underscore there uh and we're gonna say env equals make environment we're gonna need an env spec uh sorry what that does is um it gives us the specifications for environment like the shape of the observations and action space make environment spec i'm going to pass in the environment now here we're going to kind of cheat a little bit and use their built-in networks so they have one specifically for a dqn in atari evspec actions.num value so we have to specify the number of actions now what this does is that it creates a uh it'll create a convolutional network just like what we see in the deep the deep q learning paper where it's got a bunch of convolutions and then we'll actually implement the dueling architecture from the paper on dueling architectures uh for you so you don't even have to worry about that and then our agent we're gonna need to pass in a an env spec and a network and then we need an environment loop to play the game environment loop env agent and then we're going to run we're just going to do 10 episodes uh other thing i want to note here is that the default parameters don't give you good learning so you do have to do hyper parameter tuning on this that's one way in which you know they don't completely and totally hold your hand and so then we want to run our app now let's it's unhappy expected two lines i upgraded my vim installation with some pep 8 guidelines and it's unhappy about something else undefined name bools that's because it is bool there we go and now it should be okay so let's uh sorry python dqn dot pi and see what happens module acme rappers has no attribute jim atari rapper that's because it is jim atari adapter that's right there we go that is a typo let's try that okay so it's loading our reverb server that's a good sign and then boom it is learning so you see see how incredibly easy it is to get this up and running but we didn't really do a whole lot so perhaps it's not all that satisfying for you in particular one thing we can do to have a little bit more involvement in the process is to actually define our own networks and so to demonstrate how to do that i'm going to show you how to do a ddpg agent which of course has the policy and critic network the actor critic network so let's terminate that okay and then we're going to do deep deterministic policy gradients so we'll say from absol import app we'll need our wrappers again and again they have a ddpg agent which takes the networks handles the interface to reverb handles the functionality to actually add memories to that server the data table on the server has functionality for sampling a tensorflow data set uh as well as functionality for handling the learning so there's a lot that goes on under the hood in these agents and again i'll come back later to do soft after critic and twin delay deep deterministic policy gradients but we're kind of running long already about 20 minutes and so i just want to give you guys a brief overview of how all this works so we say macmi.tf import networks we're going to need some utilities we'll need acme jim numpy sonnet they always abbreviate that as snt and we will need tensorflow next we have our main function we're going to make our environment i have a few different possibilities here i'm gonna use the pendulum v one that is just a swing up pendulum it's a very simple environment but this is um not in a format that that uh deepmind's framework is gonna expect so we have to do a gym wrapper to handle the interface with the dm underscore environment the deepmind environments and one common thing they do is use the single precision wrapper to convert everything to single precision to get max throughput on your gpu then we make our environment spec we're going to need the number of dimms in our action space so we'll say mpprod any spec actions shape and cast it as an int batch concat uh then we have our policy network sonnet sequential this is where sonic comes in so we're going to pass it a a set of layers so we'll say layer norm mlp multi-layer perceptron we'll do 64 by 64 by end dimms and what this does in simple terms is it makes three layers of shape 64 64 and n dimms of a layer norm linear layer basically so what essentially they implement in the original ddpg paper and the final activation will be a tan hyperbolic very very simple very very elegant we don't have to handle compiling or anything like that that gets handled by sonnet under the hood critic network another sequential network will have what's called a critic multiplexer and what this does is that it concatenates the observations and actions as we do in the original paper then we have another layer norm mlp 64 64 and one of course one because this is the critic network it's outputs a single value next we have our agent it's a ddpg agent we have to pass in a little bit more this time i call it env spec up here i did envy spec and then we have our policy network critic network our observation network i'm not entirely sure what this is again i haven't dug into everything yet i'm just kind of excited about this and showing it to you guys as i go along and then whether or not we want to checkpoint our agents learning we'll set that to false because we don't really care at this point then we need our env loop again acme environment environment loop pass in our environment and our agent and run for just say 10 episodes and then app.run main all right so yeah it expects more spaces there and that was my only error so let's go ahead and run this and see if i made any typos ddpg pendulum one not found that's funny i must have different versions in my different uh directories for this obviously i've done this before so and the other one it didn't find pendulum v0 so let's switch that out to v0 and try again and there we go it is running already so you can see how incredibly easy that is they do have one warning here uh this looks like uh they put gradient tape dot gradient inside of a loop or something i didn't do that let me scroll down so you can see that that's something in their own code i'll probably take a look at that when i write the td 3 agent because it'll be relevant to the project but yeah otherwise you can see how incredibly simple it is to run this and it did 10 iterations very very quickly so that is how you code agents like deepmind now obviously some things that stick out to me are that one there is a lot of complexity there because deep reinforcement learning at the scale they do it is incredibly complex another thing that sticks out is that there as far as i've seen and i could be wrong but as far as i've seen there's no simple way of running agents in parallel so with something like r2d2 which is a recurrent distributed dqn agent they run in the paper 256 actors in parallel and the agent they supply here is a single threaded version there's no multi-threaded uh framework to stick on top of it now of course you're free to do stuff like uh python multi-threaded stuff or maybe even mpi type stuff on top of it but it would be helpful to see how they do that but of course they want to keep their secret sauce to themselves they can't you know give everything to the world because they need a competitive advantage right they need to keep something some cards close to their chest which i totally and completely understand other thing is that their hyper parameters are not tuned so i played around a little bit with these two agents and couldn't get very good learning um and so that probably means i needed to tune hyper parameters but for the purpose of this video i didn't get too excited about that so that's something to keep in mind is that even if you do use this it's not totally off the shelf you do have to have to do a little bit of tuning but you know they've done you know 90 of the work already so you know what more do you really want right okay i hope that was helpful for you guys this is an enormously interesting framework for me i've been digging into it trying to implement something called never give up which is a highly complex algorithm and so i found this enormously useful it's something i wanted to share with you guys something i'm going to dig deeper into in the coming weeks and months and i hope you will join me for those videos go ahead and give a subscribe if you haven't already drop any questions down below and i'll see you in the next videoin today's video we are going to use deepmind's very own deep reinforcement learning framework acme to code up a couple of simple reinforcement learning agents now as you might imagine acme is an incredibly complex framework as such we are not going to have a whole lot of time to go over all of the intricacies of it today instead in a couple future videos i will look at how to create custom reinforcement learning agents using this framework in particular we're going to implement soft androcritic and td3 as these have not been implemented by the deep mind team now if you don't know anything about those two algorithms never fear i have a couple of videos linked over here in the top left so the first thing we are going to want to do is copy the code from the github so we're going to copy the git link and then head to the terminal and do a git clone so we'll do that first we will get clone and that will clone into the acme directory and then we're going to want to create our very own virtual environment for this reason being there are a number of dependencies for this which will probably mess up your base python installation if you should do this just willy-nilly without a virtual environment so we'll do making our own virtual environment we will then activate that of course we have the cd into acme and then do source spin activate and then it will work then we have to upgrade our pip and then we can begin the process of installing the deep mind acme framework now keep in mind that being google this is implemented in tensorflow but they also have dependencies and code for the jaxx library which honestly i don't know really anything about it so we're just going to stick with the tensorflow stuff so we'll say pip install dmacm tensorflow and then we'll do pip install dm acme envs this will give us some of the base environments we will have to install jim in a moment i'll get to that in a few minutes now acme is built on top of many other packages written by deepmind one of which is something called reverb now reverb is a framework for serving machine learning data and the basic idea is that you have a server that serves data to any number of clients and we're going to be adding data tables to this when we code our own custom agents for the code we're running today it's already implemented for us but the basic idea is you create data tables within this server and then sample that data using whatever strategy you want like say prioritize experience replay or n-step transitions anything like that using any number of servers it's highly scalable it's an incredibly powerful technology and everything we're going to be doing is built atop that and so then we also need something called sonnet now sonnet sits atop of tensorflow as a sort of additional framework for creating deep neural networks i haven't played around a whole lot with it i've used it for these projects but it's not something i've done a deep dive on and i don't really know what all is going on with it so next we're going to want to install tensorflow and i'm going to specify a particular version reason being the 2.8.0 release candidate has some conflicts with some of the other codes so i'm going to specify 2.7 because i know that works now of course if you're watching this in the future maybe the 2.8 full release has no issues maybe they've ironed everything out but for now let's stick with what works next we will need tensorflow probability courses handles probability distributions for policies next we need to install something called checks i don't know what that is honestly i didn't have the energy to look it up next we have trfl truffle that is a tensorflow reinforcement learning agent package and then we want to do an upgrade for jim atari and it's really a downgrade because we're running 0.21 we want to downgrade to 0.19 because as of 0.21 or perhaps 0.20 they quit including the roms with the gym environment so when we try to run pong or any other atari game we're going to get an error so let's downgrade to 0.19 so we should be good to go so now we can do a list go to the examples we can go to atari and then we can see we have a whole bunch of files here we have run dqn run impala and run r2d2 we're just going to run a dqn here to make sure it works and i'm expecting this error here so now this is easy to deal with what it's telling us is that we don't have a particular shared object file so of course i am running linux kubuntu 20 if you're curious and so i haven't set up the ld library path environment variable in this particular virtual environment so we can check that by saying echo ld library library library path and i believe i need the dollar sign here because that's a variable and it tells us it points to uh cuda 10-10.1 so then we can locate that particular file to see where it lives and why it's unhappy so we'll just do live python 3.7 and we can see that it lives in this particular directory so what we're going to do is export ld library path equals that directory colon dollar ld library path and so let me move my face out of the way here the colon ld library path will with the dollar sign we'll append the current variable for the ld library path and we can verify that by echoing it to the terminal and we see right here that it now includes both directories as we want it to do now if you're running windows i don't know the fix for this i don't use windows except for some light gaming every now and again you should probably upgrade to a real operating system preferably linux although probably mac is going to be up more people's alley but nonetheless let's try running this again okay so now it is starting the replay the reverb replay server so i'm saving some checkpoints and i can see it's spooling up my cpu usage now so that tells me it is running okay so we do get some output to the terminal uh you can't see it but it's telling us uh table priority priority tables access without grbc it's not doing any calls over the internet because it's just my local machine it has some code that indicates it is learning by outputting to the terminal and then it also does some output from each individual episode telling you the return the length and all that good stuff so let's stop that code it takes a second to shut down the replay server and then we're going to code up our own dqm agent as well as our own ddpg agent to see really how easy it is to get started with this so we're going to get out of this directory and we're going to go back here just to the base acme directory just for giggles so we're going to them into dqn dot pi and we're going to start with our imports we're going to say from absol import app we're going to use app.run to run our program let me zoom in a little bit for you guys and then we're going to import funk tools you'll see why we need that in a little bit now all of our environments are going to be wrapped if you've taken my course on deep q learning you probably understand that it just means that we're going to take the output from the gym environments and then um apply operations like say converting the screen images from integers to floats by dividing by 255 we're going to stack frames and repeat actions that kind of stuff so acme.agents.tf import dqn so we're going to import the actual agent and from tensorflow we're going to import the networks now as i said um sorry i can't talk and type at the same time as i said we're going to be using all of the building blocks they give us and so the structure of their program is a little bit different than what i typically code on this channel because they are professionals and they know best uh their structure is that they have an agent that has both an actor and a learner so the actor is basically the policy and the learner basically collects uh the data from the reverb data table and uses that to update the deep neural networks and those networks are defined in that package networks excuse me that sub package networks and so we are going to be assembling all these building blocks into a deep q learning agent that we're going to test so the first thing we want is a function to really wrap all of our environments and this is written all of our code is written using uh i guess this type of syntax of specifying the typing came about in like 3.7 or maybe 3.6 something like that which i really like it is one of my biggest complaints about python is that it's not a strongly typed language now of course this doesn't make it strongly typed but it goes a long way towards rectifying that issue so here we're going to specify our level no frame skip dash v4 and they have something called an oar wrapper what that does is it takes the time step from the environment and wraps it into an observation action reward tuple they use it for some other agents but not for the deep q learning agent it's a little bit i find a little bit inconvenient obviously they do it because they find it convenient but for someone from the outside looking in it's a little bit difficult to wrap your mind about around maybe i'm just a i don't know other thing to note is this arrow and then the return type that tells the python interpreter what to expect for a return type now as far as the typing what i'm talking about specifically as a brief aside is that in python you can pass in a variable to a function and change its data type from say a float to a string or from a custom object to a boolean and python never says hey did you really mean to do that and it can result in some subtle bugs down the line if you write your code in such a way that it's not strongly dependent on a particular data type so sometimes strings will evaluate in certain ways versus floats and it can make it inconvenient to track down bugs so i'm glad they at least uh inserted some features to at least improve the type checking capabilities so we want to make our environment uh level full action space equals true i believe what that means is that in the um the ale imports the ale environments themselves you have a total of 18 actions for all the atari games whereas in some of the open ai implementations they reduce it down to just the stuff that actually does something so you don't have like 16 no ops and then two actions that actually do anything so then we'll say max episode length equals 108 000 if evaluation else 50 000 and then we're going to make a wrapper list that just tells the module what we want to do so we'll say wrappers.jim atari wrapper that is going to handle all the good stuff like stacking frames converting to grayscale resizing and then funktools.partial this will apply the atari wrapper and to float equals true that means we're going to convert the observations to a floating point number instead of integers and this zero discount on life loss i believe what that means is it sets the reward to zero the total discount uh to zero at the end of the episode um so that the loss of life actually affects the agent that's my understanding again i haven't mastered everything within this framework so take that in mind if you are a little bit more familiar with it and i say something incorrect please correct me in the comments i don't claim to be an expert on something that deep mind wrote i'm just using it uh to learn how the experts do stuff so then we're going to wrap all of our wrappers into our environment and return it and then we have our main function and that's not a typo you do need that underscore there uh and we're gonna say env equals make environment we're gonna need an env spec uh sorry what that does is um it gives us the specifications for environment like the shape of the observations and action space make environment spec i'm going to pass in the environment now here we're going to kind of cheat a little bit and use their built-in networks so they have one specifically for a dqn in atari evspec actions.num value so we have to specify the number of actions now what this does is that it creates a uh it'll create a convolutional network just like what we see in the deep the deep q learning paper where it's got a bunch of convolutions and then we'll actually implement the dueling architecture from the paper on dueling architectures uh for you so you don't even have to worry about that and then our agent we're gonna need to pass in a an env spec and a network and then we need an environment loop to play the game environment loop env agent and then we're going to run we're just going to do 10 episodes uh other thing i want to note here is that the default parameters don't give you good learning so you do have to do hyper parameter tuning on this that's one way in which you know they don't completely and totally hold your hand and so then we want to run our app now let's it's unhappy expected two lines i upgraded my vim installation with some pep 8 guidelines and it's unhappy about something else undefined name bools that's because it is bool there we go and now it should be okay so let's uh sorry python dqn dot pi and see what happens module acme rappers has no attribute jim atari rapper that's because it is jim atari adapter that's right there we go that is a typo let's try that okay so it's loading our reverb server that's a good sign and then boom it is learning so you see see how incredibly easy it is to get this up and running but we didn't really do a whole lot so perhaps it's not all that satisfying for you in particular one thing we can do to have a little bit more involvement in the process is to actually define our own networks and so to demonstrate how to do that i'm going to show you how to do a ddpg agent which of course has the policy and critic network the actor critic network so let's terminate that okay and then we're going to do deep deterministic policy gradients so we'll say from absol import app we'll need our wrappers again and again they have a ddpg agent which takes the networks handles the interface to reverb handles the functionality to actually add memories to that server the data table on the server has functionality for sampling a tensorflow data set uh as well as functionality for handling the learning so there's a lot that goes on under the hood in these agents and again i'll come back later to do soft after critic and twin delay deep deterministic policy gradients but we're kind of running long already about 20 minutes and so i just want to give you guys a brief overview of how all this works so we say macmi.tf import networks we're going to need some utilities we'll need acme jim numpy sonnet they always abbreviate that as snt and we will need tensorflow next we have our main function we're going to make our environment i have a few different possibilities here i'm gonna use the pendulum v one that is just a swing up pendulum it's a very simple environment but this is um not in a format that that uh deepmind's framework is gonna expect so we have to do a gym wrapper to handle the interface with the dm underscore environment the deepmind environments and one common thing they do is use the single precision wrapper to convert everything to single precision to get max throughput on your gpu then we make our environment spec we're going to need the number of dimms in our action space so we'll say mpprod any spec actions shape and cast it as an int batch concat uh then we have our policy network sonnet sequential this is where sonic comes in so we're going to pass it a a set of layers so we'll say layer norm mlp multi-layer perceptron we'll do 64 by 64 by end dimms and what this does in simple terms is it makes three layers of shape 64 64 and n dimms of a layer norm linear layer basically so what essentially they implement in the original ddpg paper and the final activation will be a tan hyperbolic very very simple very very elegant we don't have to handle compiling or anything like that that gets handled by sonnet under the hood critic network another sequential network will have what's called a critic multiplexer and what this does is that it concatenates the observations and actions as we do in the original paper then we have another layer norm mlp 64 64 and one of course one because this is the critic network it's outputs a single value next we have our agent it's a ddpg agent we have to pass in a little bit more this time i call it env spec up here i did envy spec and then we have our policy network critic network our observation network i'm not entirely sure what this is again i haven't dug into everything yet i'm just kind of excited about this and showing it to you guys as i go along and then whether or not we want to checkpoint our agents learning we'll set that to false because we don't really care at this point then we need our env loop again acme environment environment loop pass in our environment and our agent and run for just say 10 episodes and then app.run main all right so yeah it expects more spaces there and that was my only error so let's go ahead and run this and see if i made any typos ddpg pendulum one not found that's funny i must have different versions in my different uh directories for this obviously i've done this before so and the other one it didn't find pendulum v0 so let's switch that out to v0 and try again and there we go it is running already so you can see how incredibly easy that is they do have one warning here uh this looks like uh they put gradient tape dot gradient inside of a loop or something i didn't do that let me scroll down so you can see that that's something in their own code i'll probably take a look at that when i write the td 3 agent because it'll be relevant to the project but yeah otherwise you can see how incredibly simple it is to run this and it did 10 iterations very very quickly so that is how you code agents like deepmind now obviously some things that stick out to me are that one there is a lot of complexity there because deep reinforcement learning at the scale they do it is incredibly complex another thing that sticks out is that there as far as i've seen and i could be wrong but as far as i've seen there's no simple way of running agents in parallel so with something like r2d2 which is a recurrent distributed dqn agent they run in the paper 256 actors in parallel and the agent they supply here is a single threaded version there's no multi-threaded uh framework to stick on top of it now of course you're free to do stuff like uh python multi-threaded stuff or maybe even mpi type stuff on top of it but it would be helpful to see how they do that but of course they want to keep their secret sauce to themselves they can't you know give everything to the world because they need a competitive advantage right they need to keep something some cards close to their chest which i totally and completely understand other thing is that their hyper parameters are not tuned so i played around a little bit with these two agents and couldn't get very good learning um and so that probably means i needed to tune hyper parameters but for the purpose of this video i didn't get too excited about that so that's something to keep in mind is that even if you do use this it's not totally off the shelf you do have to have to do a little bit of tuning but you know they've done you know 90 of the work already so you know what more do you really want right okay i hope that was helpful for you guys this is an enormously interesting framework for me i've been digging into it trying to implement something called never give up which is a highly complex algorithm and so i found this enormously useful it's something i wanted to share with you guys something i'm going to dig deeper into in the coming weeks and months and i hope you will join me for those videos go ahead and give a subscribe if you haven't already drop any questions down below and i'll see you in the next video\n"

How to Code RL Agents Like DeepMind

Random Videos