How to Manage GPU Resource Utilization in Tensorflow and Keras

I'm not sure I believe in paying more or less, so I don't have a Mac although I suppose I could build a Hackintosh that might be kind of fun but I suspect that this will work on Apple products as well because they're POSIX based on their previous dBase so the environment variable is probably the same if not you could find it on a with a Google search. But, the first thing you want to do is you want to make sure you import OS because we're going to be tinkering with the environment variables so sorry about that. My instance of OBS studio bugged out right after I was extremely into the virtues of Linux of course it would do that at the most inopportune time but the idea here is that we're gonna make use of some environment variables to actually set the parameters for our GPU.

So, I'll give you a couple bonus ones if you have a multi-GPU system because they're quite helpful. First off, make sure that you have imported OS that's always important when you're dealing with environment variables. So, you want to set OS.dot and Biron and the first parameter I'm going to set is CUDA device device order equals equals PCI+ID. Now what this will do is set the CUDA device or order variable to the PCI bus ID. When we were looking at the Nvidia - yes in my output how it had 0 & 1 it means that it will designate my GPUs as 0 & 1. This is really helpful when you are running multi-GPU setups if you have a particular one to run let's say for sets of parameters you would have you know if the Breakout environment you'd have two sets on each GPU and then you would pass in a command parameter that would set the actual device as well as the parameters you wanted to test in your environment.

And then the next command you want is CUDA visible devices and that equals in this case is zero. This means that it will send all of it to whichever GPU you designate. So, zero or one in this case and if you have other designations than by your PCI bus ID then you'll use other designations. The variable were most interested in for the purposes of this tutorial is the TF force GPU allow growth and that gets set to true in lowercase not capital case all right not not first letter capitalized as you would use in a boolean variable with Python just lowercase true.

So, let's go ahead and save that now we'll go back to the terminal and start running this to see the actual output of it actually lets you know. Let's do this. Let's let's go ahead and comment this out to see what it uses by default then we'll come back and set its uncomment it and see how much we save. So, let's head to the terminal and do that so let's come here and clear all of this and we will run Python main.py ddq n. This is from my most recent video which you should check out on a double-deep Q network. Okay, so it is running now.

Let's see how much RAM we are using. So, in this case uses ten thousand six hundred and twelve megabytes almost the entire ten thousand nine hundred eighty-nine megabytes and you can see that goes on GPU ID zero here and then GPU one is running all the stuff associated with the operating system. Let's go ahead and stop that and then we're gonna go we're just going to use nano because all we need to do is uncomment one line so let's go ahead comment that control-x save and then run it one more time and let's see what we get.

Okay, let's run this bad boy again and we can see it is using 368 megabytes. To run I believe the the PI torch implementation was more than that. I don't remember but either way it's not using much RAM and in this case if you wanted to test with the lunar lander environment you could do I don't know close to 30 different variations if you wanted to keep track of 30 tabs on your terminal so quite useful for hyper parameter tuning and testing.

And you can verify it does run on GPU 0. So, it does everything we would expect it to do so that is quite handy. That works with Karos the of course the variable TF force GPU allow growth won't work with torch but the CUDA visible devices include CUDA device order will work with PI torch as well. This allows you to transfer it from 1 GB to another and of course PI torch has a more sane method of allocating vram from the very beginning.

I hope this has been helpful if you found it helpful then please share this like comment subscribe and I look forward to seeing you all in the next video

"WEBVTTKind: captionsLanguage: enwhat is up everybody in today's tutorial you are learn how to do GPU resource management in tensorflow and Karros it's gonna be relatively light tutorial all you need is a working installation of tensorflow and caris and you're good to go let's get started so what precisely do I mean with resource allocation so let's go ahead and take a look at the terminal to see what I mean exactly so let's clear this out so you can see what I mean now if you're in a POSIX type system you know like Linux let's say which you should be by the way if you type in video - SMI you get a list of all of the running processes on your GPU now you can see that there are two GPUs here and on one of them it is not utilizing any of the RAM and on the second one I'm utilizing 1787 megabytes and this is just for regular processes associated running with ex-work which is basically the the shell the the graphical user interface for 4k ubuntu so that's all dedicated to operating system type stuff now if we come over here and we type python main torch lunar lander dot pi what we're gonna see is now then i mistyped python typical what we're gonna see is that it'll spool up an instance of python of torch and it will start executing the lunar lander environment so let's come back here and type the same thing and see what we get we can see that it is utilizing 812 megabytes on GPU 0 so it's running on GPU 0 and only taking up about 812 megabytes okay that's pretty lean that's a pretty good operation I like that so let's go ahead and stop it and clear this out so we so we don't get confused now let's do the same thing with the tensorflow code for break out let's say python main tensorflow dqn break out that pi this is from my one of my videos where we're gonna play a break out and it will go ahead and start running on the GPU let's see it's loading and should be running momentarily yeah it's running now let's go ahead and type the same thing and see what we get what we see here is that is using ten thousand eight hundred and ninety-four of megabytes now I don't have a functional implementation of this code in PI 2 or CH but I can assure you it doesn't need that much RAM in particular you can see that it is utilizing almost all of the vram now the reason it does this is buried in the tensor flow documentation you may be wondering is this simply a consequence at the fact that we have a huge memory associated with the replay memory for the breakout environment no that's not the case the the case is that tensorflow automatically allocates all of the vram on your GPU now this may not be a problem except for the fact that then you can't really run more models on the GPU so if you want to run more than one thing on your GPU then running out of memory is most definitely a problem or if you want to run let's say even in particular you want to run a PI torch implementation of something and a tensorflow implementation of something or care us you can't do it because tensorflow hogs all of the vram now I'm gonna show you how to take this from hogging all of the vram and going down into something more reasonable and the reason it does this is because and it's buried within the documentation is that tensorflow wants to prevent memory fragmentation you remember the old days of Windows on regular hard drives where Windows would do this funny thing where it would store fragments of files in different places in your hard drive and so you would have seek times associated with bouncing around on the hard drive trying to find the relevant data for whatever program you're running with solid state disks that doesn't really matter because you know they're much much faster than hard disk so you don't really do any disk defragment ting on a solid-state disk cause you shouldn't even bother and of course Linux doesn't do that so you should be running the superior operating system to Windows Linux of course but I digress so tensorflow does it to prevent memory fragmentation but there is a way around this I want to show it to you presently so let's go ahead and head back to our code editor and see a couple different ways of fixing this so there are two different ways of fixing this and both in the documentation we're gonna focus on the way that works best with vanilla tensorflow first and then I'll show you the way that works most most bestest with Karis so the magic happens here when we go ahead and call the session so we have an object called a TF dot config proto and it's basically an object that tells you how to initialize the variables of basically the environment you know how you want to run your simulations and so next thing you want to do is say config GPU underscore options not allow growth equals true and what this will do is will only allocate as much memory as it thinks is necessary to begin with and it will expand that memory as it becomes necessary and then when you invoke your session you want to say config equals config so let's go ahead and save that and we'll head back to the terminal and run the Dre breakout program again so just a quick reminder when we last ran the breakout program at use ten thousand eight hundred and ninety four megabytes so I already stopped it let's go ahead and run it again and see what we get so it's loading up everything giving me a bunch of errors which I started to get after updating to tensorflow 114 I don't know what that's about all the code still runs just fine so I tend to ignore it if anybody knows the option to suppress those warnings please let me know in the comments down below because they annoy the crap out of me so let's go ahead and see how much RAM we are utilizing now so perfect it went from ten thousand eight hundred and ninety-four down to four thousand seven hundred and forty four megabytes which is a far more manageable number and so this would allow you to run multiple different agents on one GPU let's go ahead and run it again see if it grew a little bit no it did not so let's come back here and actually let's test this let's go ahead and try it not ddq n we want to say python main TF DQ n breakout and let's see if we can actually run two different models at the same time previously this would give you a CUDA out of memory error here it's okay loading the dynamic libraries all signs look good and there it goes it is playing a second game so why is this useful so this is really useful if you want to perform hyper parameter tuning in real time so what you would do is you would refer back to my previous video on how to automate testing of agents and reinforcement learning using the command line and you would incorporate some extra parameters to enable this particular feature of allowing the GPU memory growth and then you would run to different agents three different agents in fact let's go over here and see how much vram are utilizing now let's see and it's double precisely of what you would expect 94 81 pretty good so we could we couldn't fit in a third model unfortunately we only do two at one time but hey you can't have everything but the point is that you can take two different learning rates or even two different model architectures and run them in parallel and get output to see which one you like better which one looks more promising and then take that one and then perhaps test it against another set of parameters and a round-robin type style so that is quite useful in tensorflow what about Kara so in care us we don't have the config so you know can't care us obscures all that from us so how would we do that in care us let's head back to the code editor and find out so I'll go ahead and upload this to my github you may have to hunt around for it I can actually maybe I'll even put it in the readme at some point I don't really want this to be obscured because I think this is an important feature of tensorflow and it's it's kind of buried in the documentation you really have to go looking for it but I'll go ahead and comment this out and upload this to the github and if I can remember to do it I will add it to the readme which I need to kind of go through the readme and redo it basically because it's really outdated but I'll update the readme if at all possible so this is for the tensorflow implementation of a deep queue network and this was tested in the breakout environment now how would you do it in Kharis so there is a different way of doing this and it's a little bit more platform specific I haven't tested it on my Windows implementation I don't have access to a Mac I don't believe in paying more or less so I don't have a Mac although I suppose I could build a hackintosh that might be kind of fun but I suspect that this will work on Apple products as well because they're POSIX based on their previous dBase so the environment variable is probably the same if not you could find it on a with a Google search but the first thing you want to do is you want to make sure you import OS because we're going to be tinkering with the environment variables so sorry about that so my instance of OBS studio bugged out right after i was extremley the virtues of linux of course it would do that at the most inopportune time but the idea here is that we're gonna make use of some environment variables to actually set the parameters for our GPU so I'll give you a couple bonus ones if you have a multi GPU system because they're quite helpful so first off make sure that you have imported OS that's always important when you're dealing with environment variables so you want to set OS dot and Biron and the first parameter I'm going to set is CUDA device device order equals equals PCI + ID now what this will do is set the CUDA device or order variable to the PCI bus ID so when we were looking at the Nvidia - yes in my output how it had 0 & 1 it means that it will designate my GPUs as 0 & 1 this is really helpful when you are running multi-gpu setups if you in particular one to run let's say for sets of parameters you would have you know if the Breakout environment you'd have two sets on each GPU and then you would pass in a command parameter that would set the actual device as well as the parameters you wanted to test in your environment and then the next command you want is CUDA visible devices and that equals in this case is zero nasca passing is a string all inputs this must be a string and what this will do is it'll send this particular instance of whatever we run here tensorflow Karos or even pi torch this works in pi torch as well it'll send all of it to whichever GP you designate so zero or one in this case and if you have other designations than by your PCI bus ID then you'll use other designations the variable were most interested in for the purposes of this tutorial is the TF force GPU allow growth and that gets set to true in lowercase not capital case all right not not first letter capitalised as you would use in a boolean variable with Python just lowercase true so let's go ahead and save that now we'll go back to the terminal and start running this to see the actual output of it actually lets you know let's do this let's let's go ahead and comment this out to see what it uses by default then we'll come back and set its uncomment it and see how much we save so let's head to the terminal and do that so let's come here and clear all of this and we will run Python main-care us ddq n this is from my most recent video which you should check out on a double-deep q network ok so it is running now let's see how much ram we are using so in this case uses ten thousand six hundred and twelve megabytes almost the entire ten thousand nine hundred eighty nine megabytes and you can see that goes on GPU ID zero here and then GPU one is running all the stuff associated with the operating system so let's go ahead and stop that and then we're gonna go we're just going to use nano because all we need to do is uncomment one line so let's go ahead comment that control-x save and then run it one more time and let's see what we get okay let's run this bad boy again and we can see it is using 368 megabytes to run I believe the the PI torch implementation was more than that I don't remember but either way it's not using much RAM and in this case if you wanted to test with the lunar lander environment you could do I don't know close to 30 different variations if you wanted to keep track of 30 tabs on your terminal so quite useful for hyper parameter tuning and testing and you can verify it does run on GPU 0 so it does everything we would expect it to do so that is quite handy that works with Karos the of course the variable TF force GPU allow growth won't work with torch but the CUDA visible devices include CUDA device order will work with PI torch as well that allows you to transfer it from 1 GB to another and of course PI torch has a more sane method of allocating vram from the very beginning I hope this has been helpful if you found it helpful then please share this like comment subscribe and I look forward to seeing you all in the next videowhat is up everybody in today's tutorial you are learn how to do GPU resource management in tensorflow and Karros it's gonna be relatively light tutorial all you need is a working installation of tensorflow and caris and you're good to go let's get started so what precisely do I mean with resource allocation so let's go ahead and take a look at the terminal to see what I mean exactly so let's clear this out so you can see what I mean now if you're in a POSIX type system you know like Linux let's say which you should be by the way if you type in video - SMI you get a list of all of the running processes on your GPU now you can see that there are two GPUs here and on one of them it is not utilizing any of the RAM and on the second one I'm utilizing 1787 megabytes and this is just for regular processes associated running with ex-work which is basically the the shell the the graphical user interface for 4k ubuntu so that's all dedicated to operating system type stuff now if we come over here and we type python main torch lunar lander dot pi what we're gonna see is now then i mistyped python typical what we're gonna see is that it'll spool up an instance of python of torch and it will start executing the lunar lander environment so let's come back here and type the same thing and see what we get we can see that it is utilizing 812 megabytes on GPU 0 so it's running on GPU 0 and only taking up about 812 megabytes okay that's pretty lean that's a pretty good operation I like that so let's go ahead and stop it and clear this out so we so we don't get confused now let's do the same thing with the tensorflow code for break out let's say python main tensorflow dqn break out that pi this is from my one of my videos where we're gonna play a break out and it will go ahead and start running on the GPU let's see it's loading and should be running momentarily yeah it's running now let's go ahead and type the same thing and see what we get what we see here is that is using ten thousand eight hundred and ninety-four of megabytes now I don't have a functional implementation of this code in PI 2 or CH but I can assure you it doesn't need that much RAM in particular you can see that it is utilizing almost all of the vram now the reason it does this is buried in the tensor flow documentation you may be wondering is this simply a consequence at the fact that we have a huge memory associated with the replay memory for the breakout environment no that's not the case the the case is that tensorflow automatically allocates all of the vram on your GPU now this may not be a problem except for the fact that then you can't really run more models on the GPU so if you want to run more than one thing on your GPU then running out of memory is most definitely a problem or if you want to run let's say even in particular you want to run a PI torch implementation of something and a tensorflow implementation of something or care us you can't do it because tensorflow hogs all of the vram now I'm gonna show you how to take this from hogging all of the vram and going down into something more reasonable and the reason it does this is because and it's buried within the documentation is that tensorflow wants to prevent memory fragmentation you remember the old days of Windows on regular hard drives where Windows would do this funny thing where it would store fragments of files in different places in your hard drive and so you would have seek times associated with bouncing around on the hard drive trying to find the relevant data for whatever program you're running with solid state disks that doesn't really matter because you know they're much much faster than hard disk so you don't really do any disk defragment ting on a solid-state disk cause you shouldn't even bother and of course Linux doesn't do that so you should be running the superior operating system to Windows Linux of course but I digress so tensorflow does it to prevent memory fragmentation but there is a way around this I want to show it to you presently so let's go ahead and head back to our code editor and see a couple different ways of fixing this so there are two different ways of fixing this and both in the documentation we're gonna focus on the way that works best with vanilla tensorflow first and then I'll show you the way that works most most bestest with Karis so the magic happens here when we go ahead and call the session so we have an object called a TF dot config proto and it's basically an object that tells you how to initialize the variables of basically the environment you know how you want to run your simulations and so next thing you want to do is say config GPU underscore options not allow growth equals true and what this will do is will only allocate as much memory as it thinks is necessary to begin with and it will expand that memory as it becomes necessary and then when you invoke your session you want to say config equals config so let's go ahead and save that and we'll head back to the terminal and run the Dre breakout program again so just a quick reminder when we last ran the breakout program at use ten thousand eight hundred and ninety four megabytes so I already stopped it let's go ahead and run it again and see what we get so it's loading up everything giving me a bunch of errors which I started to get after updating to tensorflow 114 I don't know what that's about all the code still runs just fine so I tend to ignore it if anybody knows the option to suppress those warnings please let me know in the comments down below because they annoy the crap out of me so let's go ahead and see how much RAM we are utilizing now so perfect it went from ten thousand eight hundred and ninety-four down to four thousand seven hundred and forty four megabytes which is a far more manageable number and so this would allow you to run multiple different agents on one GPU let's go ahead and run it again see if it grew a little bit no it did not so let's come back here and actually let's test this let's go ahead and try it not ddq n we want to say python main TF DQ n breakout and let's see if we can actually run two different models at the same time previously this would give you a CUDA out of memory error here it's okay loading the dynamic libraries all signs look good and there it goes it is playing a second game so why is this useful so this is really useful if you want to perform hyper parameter tuning in real time so what you would do is you would refer back to my previous video on how to automate testing of agents and reinforcement learning using the command line and you would incorporate some extra parameters to enable this particular feature of allowing the GPU memory growth and then you would run to different agents three different agents in fact let's go over here and see how much vram are utilizing now let's see and it's double precisely of what you would expect 94 81 pretty good so we could we couldn't fit in a third model unfortunately we only do two at one time but hey you can't have everything but the point is that you can take two different learning rates or even two different model architectures and run them in parallel and get output to see which one you like better which one looks more promising and then take that one and then perhaps test it against another set of parameters and a round-robin type style so that is quite useful in tensorflow what about Kara so in care us we don't have the config so you know can't care us obscures all that from us so how would we do that in care us let's head back to the code editor and find out so I'll go ahead and upload this to my github you may have to hunt around for it I can actually maybe I'll even put it in the readme at some point I don't really want this to be obscured because I think this is an important feature of tensorflow and it's it's kind of buried in the documentation you really have to go looking for it but I'll go ahead and comment this out and upload this to the github and if I can remember to do it I will add it to the readme which I need to kind of go through the readme and redo it basically because it's really outdated but I'll update the readme if at all possible so this is for the tensorflow implementation of a deep queue network and this was tested in the breakout environment now how would you do it in Kharis so there is a different way of doing this and it's a little bit more platform specific I haven't tested it on my Windows implementation I don't have access to a Mac I don't believe in paying more or less so I don't have a Mac although I suppose I could build a hackintosh that might be kind of fun but I suspect that this will work on Apple products as well because they're POSIX based on their previous dBase so the environment variable is probably the same if not you could find it on a with a Google search but the first thing you want to do is you want to make sure you import OS because we're going to be tinkering with the environment variables so sorry about that so my instance of OBS studio bugged out right after i was extremley the virtues of linux of course it would do that at the most inopportune time but the idea here is that we're gonna make use of some environment variables to actually set the parameters for our GPU so I'll give you a couple bonus ones if you have a multi GPU system because they're quite helpful so first off make sure that you have imported OS that's always important when you're dealing with environment variables so you want to set OS dot and Biron and the first parameter I'm going to set is CUDA device device order equals equals PCI + ID now what this will do is set the CUDA device or order variable to the PCI bus ID so when we were looking at the Nvidia - yes in my output how it had 0 & 1 it means that it will designate my GPUs as 0 & 1 this is really helpful when you are running multi-gpu setups if you in particular one to run let's say for sets of parameters you would have you know if the Breakout environment you'd have two sets on each GPU and then you would pass in a command parameter that would set the actual device as well as the parameters you wanted to test in your environment and then the next command you want is CUDA visible devices and that equals in this case is zero nasca passing is a string all inputs this must be a string and what this will do is it'll send this particular instance of whatever we run here tensorflow Karos or even pi torch this works in pi torch as well it'll send all of it to whichever GP you designate so zero or one in this case and if you have other designations than by your PCI bus ID then you'll use other designations the variable were most interested in for the purposes of this tutorial is the TF force GPU allow growth and that gets set to true in lowercase not capital case all right not not first letter capitalised as you would use in a boolean variable with Python just lowercase true so let's go ahead and save that now we'll go back to the terminal and start running this to see the actual output of it actually lets you know let's do this let's let's go ahead and comment this out to see what it uses by default then we'll come back and set its uncomment it and see how much we save so let's head to the terminal and do that so let's come here and clear all of this and we will run Python main-care us ddq n this is from my most recent video which you should check out on a double-deep q network ok so it is running now let's see how much ram we are using so in this case uses ten thousand six hundred and twelve megabytes almost the entire ten thousand nine hundred eighty nine megabytes and you can see that goes on GPU ID zero here and then GPU one is running all the stuff associated with the operating system so let's go ahead and stop that and then we're gonna go we're just going to use nano because all we need to do is uncomment one line so let's go ahead comment that control-x save and then run it one more time and let's see what we get okay let's run this bad boy again and we can see it is using 368 megabytes to run I believe the the PI torch implementation was more than that I don't remember but either way it's not using much RAM and in this case if you wanted to test with the lunar lander environment you could do I don't know close to 30 different variations if you wanted to keep track of 30 tabs on your terminal so quite useful for hyper parameter tuning and testing and you can verify it does run on GPU 0 so it does everything we would expect it to do so that is quite handy that works with Karos the of course the variable TF force GPU allow growth won't work with torch but the CUDA visible devices include CUDA device order will work with PI torch as well that allows you to transfer it from 1 GB to another and of course PI torch has a more sane method of allocating vram from the very beginning I hope this has been helpful if you found it helpful then please share this like comment subscribe and I look forward to seeing you all in the next video\n"