Running ROCm Under VMWare 8 on the AMD Mi210 - ROCm You Like A Hurricane

Getting Started with Rock M on VMware

To begin, we need to edit the settings and go to VM options and then Advanced and then edit configuration. For "Don't use Edge," you should use Chrome because there are bugs when adding parameters, and they don't even expect you to use ESXi; they want you to use vSphere and just like that, we're able to add our PCIe device. This is a good spot to start, as it allows us to get the virtual machine up and running with Rock M.

When setting up the VM, there may be an error, which can simply be resolved by rebooting and rerunning the installer. It's worth noting that I initially encountered this issue when first setting up Rock M 5.3 but was using Ubuntu 22.04 LTS for this video, which is a fresh install. To ensure everything is current, I've been going through the setup process again to confirm it's all still good.

Another issue I encountered was related to the kernel, specifically with DMA underscore resv points to sequence being missing when trying to build for the kernel. This originally happened in version 5.9 but reappeared in 5.19 due to changes in how the kernel works internally. However, once we passed through all the DKMS kerfuffle, we were able to run the Rocky validation suite. To do this, simply install the Rockin' validation suite using apt, and then run it to see what the output is.

In our case, because we've only passed through one GPU, we can see that 42 Watts of power are running currently, with a 300-watt power cap and an internal temperature of 39 degrees Celsius. This isn't bad, especially considering we're using a significant amount of graphics horsepower from three Instinct Mi 210 graphics cards.

If you want to take this further and install something like Docker or containerization, there is actually good support for that at this point with Rock M5. The hard work and effort from Andy's team has not gone unnoticed, especially when it comes to supporting Ubuntu 22.04. It took some time to get everything up and running, but the rock M base has come a long way.

One of the benefits of using this setup is that system administrators can allocate virtual machines and assign parts of the GPU or the entire GPU to specific research groups or teams. This means they can reasonably sandbox their virtual machines without disturbing other users who are also using the same resources. It's not necessary to give someone dedicated hardware resources, as GPUs are expensive but can be shared among multiple people running different jobs.

This approach allows for a more efficient use of resources and reduces the overhead and complexity associated with giving someone dedicated hardware resources. Additionally, it enables users to move virtual machines between containers and other environments without significant issues. However, this does require some workarounds, as live migrating a virtual machine when passing through a PCIe device doesn't work unless the device builder builds in state management.

Overall, getting started with Rock M on VMware can be a bit complex but is ultimately worth it for the benefits it provides. By following these steps and understanding how to set up and manage your Rock M environment, you can unlock significant performance and efficiency gains for your applications and users.

"WEBVTTKind: captionsLanguage: enAMD has been making great great strides with rock M and their Mi and cdna today I'm going to show you how to update the firmware on your mi-200s so that you can run basically the Rockham stack under VMware 8 with the new VMware extensions for that for containers and orchestration and infrastructure as code because infrastructure's code is the future and you need to get your feet away with that just a little bit but also this is really cool we can run stable diffusion and everything else from the AMD side of things yeah turns out got a form thread on that that's another thing that's going to be a different video Let's uh let's Dive In foreign the first thing that you got to know is I'm rocking the super micro big twin 2u2 node system this thing has six in my 250s in it this is a fabulously expensive system I gotta send it back soon but I'm trying to get some quality time with rock M and container orchestration see it's not enough that you have a GPU that can run Rock M it's not enough that you have systems that are running Ubuntu and are multi-user and you're able to do stuff you need a little bit more separation you need containerization how are you going to manage this VMware is it popular choice for that this is not a benchmark of VMware or anything like that this is just to show you the functionality that's available with VMware 8 because there have been a ton of positive improvements in VMware under the hood and this is just one of them we can take our six in my two tens in this platform and slice them up among any number of virtual machines the vram processes the whole nine yards in order to do that we get a start by updating the firmware and you can't really easily update the firmware from within VMware so the easiest way that I've found to do this is to use a new button to server installer live USB so to begin this process you basically just create an Ubuntu live USB installer like you're going to install Ubuntu over your VMware machine although we're not going to do that we're just going to boot off USB we can do everything via remote management the ipmi once you've got your USB made I assume you know how to do that you just boot and you're good to go now to do this video my capture machine and the thing that's running the audio because who wants to be in that loud server closet I've got a setup where I can SSH into it because this is an installer it's a little weird you can't SSH into the installer out of the box well you can set the password and then you can SSH into it and then that works fine it's a little weird on the AMD download page it looks like the download button is missing it's not it just takes you to a PDF and then the PDF walks you through making permanent changes to your Ubuntu system but this is a live installer USB it's going to be gone as soon as I reboot the machine that's okay that's all we need to update this firmware this is a really easy process we're gonna follow the guide and just copy paste in here it's a little weird the doormatting because of this PDF I really could have done a little bit better job here because with the formatting you end up with some extra line breaks you don't need so be careful when you're pasting content in here or just retype them knowing that you probably don't actually have line breaks maybe they expect you to retype it and not copy paste it no copy paste what is this 1997 come on anyway all right we're good to go we're able to run the AMD firmware update tool and I just did Dash U the manual talks about update I FW but I I typo that you just do you it should scan and find the gpus now if it says you're GPU is in use you can just RM mod AMD GPU because again this is the installer and even though it loaded the kernel module it's not actually using it the the installation guide talks about blacklisting the AMD GPU driver and rebooting so that the AMD GPU module is not loaded this is an installer you don't have to do that because we're just doing this to update the firmware for VMware so this this part of the guide doesn't really exactly 100 apply to us now from here we can run the tool agree to the EULA and uh yeah it's going to say oh I found their gpus and you just do Dash U and it does the update it'll go through with the update it'll take a little bit to run and at this point we want to power down the system in order to proceed now because this is the two U2 node system I've only updated three out of my six gpus in a real world scenario this is two nodes could be part of a VMware cluster assuming my witness is running somewhere else I could be running out of Synology now yes not recommended but it works I've done it you'll want to uh finish all this on this node and then you can migrate your virtual machines from the other node to this one which will be back to running VMware in just a moment and then we will be able to update the other node that's how that normally works you migrate everything off the host and then booting from USB to update the firmware not a big deal that's resiliency and good solid engineering and architecture VMware so this this kind of an update process uh is not really much of a problem for this video I'm going to walk you through just kind of a fresh installation of VMware so you want to log into the VMware portal and download your uh your esxi installers as well as vsphere vsphere is what you use for managing esxi is what runs on the host generally you don't want to do much of anything from the esxi web interface I know it's there and I know you can get the free version of esxi but look if you're living in the VMware world you really should just commit to the the 200 a year or whatever it is like the personal home training license thing whatever it is because that unlocks basically everything that VMware has and yeah it's still a subscription and yeah 200 a year is still kind of a lot of money but you'll be able to play with all this on your own personal hardware and non-production you know Hardware learning kind of stuff that's kind of why it's there but for this video I'm just doing the 60-day trial don't copy paste my serial numbers if I forgot to if I forgot to delete those you make again another bootable USB installer and uh you do the installation you can do it over ipmi or you can do it from the data center the installation for VMware is pretty straightforward and if you've ever installed VMware before well not a lot is different with VMware 8. if this is your first look at VMware 8. streamlined a bit but not a lot different logging into esxi visually it's got a facelift but you know esxi most of the magic actually is in vsphere but there are a couple of things that you can do from esxi before you install vsphere if you want the first is to enable SSH we're going to be installing a Vib here which is like a driver VMware thing and this is VMware 8 native so amd's got support basically from day one for VMware 8 which is pretty awesome uh then we need to log in with SSH and install it now I had to install it by telling it to disable certificate checking but there's also a zip file like sometimes the you install a zip or you specify zip that contains a Vib versus specifying the raw Vibes specify the raw Vib and it's fine but I uploaded everything that was in the archive there wasn't really anything in the firmware update PDF that had exactly a you know a step-by-step Administration I think for if you if you've been through the VMware training and certification you probably already know how to install a network driver you've probably already been through that for various kinds of things that you might encounter but that's pretty much all there is to it it says it doesn't require a reboot so the next thing we'll do is make sure our data store is configured upload a Ubuntu ISO and see if we can configure a virtual machine now in case you're not familiar with what we're trying to do with VMware here this is shared pass-through Graphics compatibility it is actually even listed on the VMware compatibility guide so VMware does sort of sanction this Behavior either mi210 support for this is a little bit bleeding edge okay sure but you'll be able to run this under the uh Ubuntu operating system this is different than the vdi infrastructure this is different than other types of GPU sharing technology this is really meant for rock M and the compute side of things so you know if all else fails be sure to check the VMware compatibility list but you can kind of use this as a Rough Guide to set it up and start experimenting and see what the options are now if you've never done gpus or virtual functions or you're messing around with this on a lab machine or you picked up one of these on eBay or even the mx-25 as far as that goes there's still plenty you can learn on Ancient Ancient gpus uh you might get a lot of frustrating errors from VMware it's like oh it won't start there's problems most of the time it's because you forgot to enable above 4G decoding in BIOS sometimes there are are some Advanced parameters that you need to set on your virtual machine so if you log in Via SSH and you search for vmware.log that's usually the easiest place to check for it if you've got vsphere installed you can also get the VMware logs that way esxi is not often perfectly reliable as far as this goes they really want you to use vsphere to do management I like to do everything from SSH that's just me it's usually pretty easy to find that vmware.log it's usually in your primary data store although your error message is something about like device power on won't power on or you know whatever that's nothing to do with AMD and nothing to do with VMware nothing to do with anything like that it's something you need to configure so that the memory mapped i o doesn't require any translation when it goes from physical to Virtual to physical the other thing is that you do it's a good idea to set pcie or PCI passthrough.u64-bit mmio to true and pcie or PCI passthrough.64-bit mmio size to 128. where you do that is in in the advanced properties of your virtual machine so you want to edit the settings and go to VM options and then Advanced and then edit configuration and for don't use Edge for this you should use Chrome probably because when you hit add parameter there's all kinds of bugs here they don't even expect you really to use esxi it's yeah they want you to use vsphere and just like that we're able to add our pcie device cool good stuff now let's see if we can get it to show up in Ubuntu in the virtual machine and run Rock m is normal when you're doing your setup don't worry if you get an error it just means that there was a kernel update for your install when you're doing you know I have to Insight pops up and says oh there's some system updates you need to update not a problem just reboot rerun the installer it's fine now as I Was preparing this video I've actually been working on this video for a while when I first set it up I got rock M 5.3 but I'm using Ubuntu 22.04 LTS because this is a fresh install I'm going through everything again for this video to make sure it's all still current and I ran into another issue that wasn't the kernel problem where you just update the kernel and reboot where it was an issue on GitHub where it's like dma underscore resv points to sequence is missing this is trying to build for the kernel this originally happened like back on 5.9 but this is happening again on 5.19 because they're changing the way the kernel Works internally once we're past all the dkms kerfuffle we're able to run the rock invalidation Suite you can just apt install Rockin validation suite and then run the Rocky and validation Suite to see what the output is and in our case because we've only passed through the one GPU we can see here 42 Watts running currently 300 watt power cap currently 39 degrees C not bad if you want to go from here and install something like Docker or some other containerization or platform management thing or whatever there is actually pretty good support for that at this point with the rock M base there's still some rough edges but Andy's come a long way with rock M5 and on 5.4 that hard work has not gone unnoticed you know it sort of took a while to get support for Ubuntu 22.04 and there was a lot of uh you know sort of hard one Lessons Learned spinning up the stack and getting everything going amd's got to build momentum and build that gargantuan inertia for these kinds of things because this level of functionality and this infrastructure eventually will be able to move mountains by building this it means that system administrators and people that manage the systems can allocate virtual machines and assign parts of the GPU or the entire GPU to one research group or team or whatever and then that team is reasonably sandboxed uh you know with their virtual machines so they're not necessarily going to disturb other people that are using the same resources you don't have to have the same overhead and complexity of giving somebody dedicated Hardware resources dedicated GPU resources in this case because gpus are expensive compute gpus are expensive but you can sort of share the load among a whole bunch of different people that are running a whole bunch of different jobs Without Really interfering with one another which is nice and you get to use it with your existing VMware infrastructure which is also nice a lot of people look for that a lot of people look to be able to move the virtual machines between containers and all that kind of thing and it's a little tricky to do when we're talking about passing through pcie devices that makes things a little more complicated obviously because live migrating and running virtual machine when you're passing through a pcie device that's not going to work unless the the device Builder builds in a whole bunch of stuff to do State Management so that you could move a job running on a piece of Hardware from one piece of Hardware to another it works a little differently than networking and everything else so I don't know this has been a quick look at Rock M on the VMware platform getting that installed set up up and running and on our super micro big twin 2u2 node system rocking 32 core AMD epic Milan CPUs 256 gigs of memory per node and three Instinct Mi 210 graphics cards that is a ridiculous amount of Graphics horsepower if you're looking for something a little bit more pedestrian we do actually already have the stable diffusion guide on the level one Forum gigabuster put that together you should check that out this sort of dovetails with that this isn't that video there's another video coming about getting all of that running and oh boy those are fast shockingly fast it's all about the software stack though I'm windowless level one I'm signing out you can find me in the level 1 forumsAMD has been making great great strides with rock M and their Mi and cdna today I'm going to show you how to update the firmware on your mi-200s so that you can run basically the Rockham stack under VMware 8 with the new VMware extensions for that for containers and orchestration and infrastructure as code because infrastructure's code is the future and you need to get your feet away with that just a little bit but also this is really cool we can run stable diffusion and everything else from the AMD side of things yeah turns out got a form thread on that that's another thing that's going to be a different video Let's uh let's Dive In foreign the first thing that you got to know is I'm rocking the super micro big twin 2u2 node system this thing has six in my 250s in it this is a fabulously expensive system I gotta send it back soon but I'm trying to get some quality time with rock M and container orchestration see it's not enough that you have a GPU that can run Rock M it's not enough that you have systems that are running Ubuntu and are multi-user and you're able to do stuff you need a little bit more separation you need containerization how are you going to manage this VMware is it popular choice for that this is not a benchmark of VMware or anything like that this is just to show you the functionality that's available with VMware 8 because there have been a ton of positive improvements in VMware under the hood and this is just one of them we can take our six in my two tens in this platform and slice them up among any number of virtual machines the vram processes the whole nine yards in order to do that we get a start by updating the firmware and you can't really easily update the firmware from within VMware so the easiest way that I've found to do this is to use a new button to server installer live USB so to begin this process you basically just create an Ubuntu live USB installer like you're going to install Ubuntu over your VMware machine although we're not going to do that we're just going to boot off USB we can do everything via remote management the ipmi once you've got your USB made I assume you know how to do that you just boot and you're good to go now to do this video my capture machine and the thing that's running the audio because who wants to be in that loud server closet I've got a setup where I can SSH into it because this is an installer it's a little weird you can't SSH into the installer out of the box well you can set the password and then you can SSH into it and then that works fine it's a little weird on the AMD download page it looks like the download button is missing it's not it just takes you to a PDF and then the PDF walks you through making permanent changes to your Ubuntu system but this is a live installer USB it's going to be gone as soon as I reboot the machine that's okay that's all we need to update this firmware this is a really easy process we're gonna follow the guide and just copy paste in here it's a little weird the doormatting because of this PDF I really could have done a little bit better job here because with the formatting you end up with some extra line breaks you don't need so be careful when you're pasting content in here or just retype them knowing that you probably don't actually have line breaks maybe they expect you to retype it and not copy paste it no copy paste what is this 1997 come on anyway all right we're good to go we're able to run the AMD firmware update tool and I just did Dash U the manual talks about update I FW but I I typo that you just do you it should scan and find the gpus now if it says you're GPU is in use you can just RM mod AMD GPU because again this is the installer and even though it loaded the kernel module it's not actually using it the the installation guide talks about blacklisting the AMD GPU driver and rebooting so that the AMD GPU module is not loaded this is an installer you don't have to do that because we're just doing this to update the firmware for VMware so this this part of the guide doesn't really exactly 100 apply to us now from here we can run the tool agree to the EULA and uh yeah it's going to say oh I found their gpus and you just do Dash U and it does the update it'll go through with the update it'll take a little bit to run and at this point we want to power down the system in order to proceed now because this is the two U2 node system I've only updated three out of my six gpus in a real world scenario this is two nodes could be part of a VMware cluster assuming my witness is running somewhere else I could be running out of Synology now yes not recommended but it works I've done it you'll want to uh finish all this on this node and then you can migrate your virtual machines from the other node to this one which will be back to running VMware in just a moment and then we will be able to update the other node that's how that normally works you migrate everything off the host and then booting from USB to update the firmware not a big deal that's resiliency and good solid engineering and architecture VMware so this this kind of an update process uh is not really much of a problem for this video I'm going to walk you through just kind of a fresh installation of VMware so you want to log into the VMware portal and download your uh your esxi installers as well as vsphere vsphere is what you use for managing esxi is what runs on the host generally you don't want to do much of anything from the esxi web interface I know it's there and I know you can get the free version of esxi but look if you're living in the VMware world you really should just commit to the the 200 a year or whatever it is like the personal home training license thing whatever it is because that unlocks basically everything that VMware has and yeah it's still a subscription and yeah 200 a year is still kind of a lot of money but you'll be able to play with all this on your own personal hardware and non-production you know Hardware learning kind of stuff that's kind of why it's there but for this video I'm just doing the 60-day trial don't copy paste my serial numbers if I forgot to if I forgot to delete those you make again another bootable USB installer and uh you do the installation you can do it over ipmi or you can do it from the data center the installation for VMware is pretty straightforward and if you've ever installed VMware before well not a lot is different with VMware 8. if this is your first look at VMware 8. streamlined a bit but not a lot different logging into esxi visually it's got a facelift but you know esxi most of the magic actually is in vsphere but there are a couple of things that you can do from esxi before you install vsphere if you want the first is to enable SSH we're going to be installing a Vib here which is like a driver VMware thing and this is VMware 8 native so amd's got support basically from day one for VMware 8 which is pretty awesome uh then we need to log in with SSH and install it now I had to install it by telling it to disable certificate checking but there's also a zip file like sometimes the you install a zip or you specify zip that contains a Vib versus specifying the raw Vibes specify the raw Vib and it's fine but I uploaded everything that was in the archive there wasn't really anything in the firmware update PDF that had exactly a you know a step-by-step Administration I think for if you if you've been through the VMware training and certification you probably already know how to install a network driver you've probably already been through that for various kinds of things that you might encounter but that's pretty much all there is to it it says it doesn't require a reboot so the next thing we'll do is make sure our data store is configured upload a Ubuntu ISO and see if we can configure a virtual machine now in case you're not familiar with what we're trying to do with VMware here this is shared pass-through Graphics compatibility it is actually even listed on the VMware compatibility guide so VMware does sort of sanction this Behavior either mi210 support for this is a little bit bleeding edge okay sure but you'll be able to run this under the uh Ubuntu operating system this is different than the vdi infrastructure this is different than other types of GPU sharing technology this is really meant for rock M and the compute side of things so you know if all else fails be sure to check the VMware compatibility list but you can kind of use this as a Rough Guide to set it up and start experimenting and see what the options are now if you've never done gpus or virtual functions or you're messing around with this on a lab machine or you picked up one of these on eBay or even the mx-25 as far as that goes there's still plenty you can learn on Ancient Ancient gpus uh you might get a lot of frustrating errors from VMware it's like oh it won't start there's problems most of the time it's because you forgot to enable above 4G decoding in BIOS sometimes there are are some Advanced parameters that you need to set on your virtual machine so if you log in Via SSH and you search for vmware.log that's usually the easiest place to check for it if you've got vsphere installed you can also get the VMware logs that way esxi is not often perfectly reliable as far as this goes they really want you to use vsphere to do management I like to do everything from SSH that's just me it's usually pretty easy to find that vmware.log it's usually in your primary data store although your error message is something about like device power on won't power on or you know whatever that's nothing to do with AMD and nothing to do with VMware nothing to do with anything like that it's something you need to configure so that the memory mapped i o doesn't require any translation when it goes from physical to Virtual to physical the other thing is that you do it's a good idea to set pcie or PCI passthrough.u64-bit mmio to true and pcie or PCI passthrough.64-bit mmio size to 128. where you do that is in in the advanced properties of your virtual machine so you want to edit the settings and go to VM options and then Advanced and then edit configuration and for don't use Edge for this you should use Chrome probably because when you hit add parameter there's all kinds of bugs here they don't even expect you really to use esxi it's yeah they want you to use vsphere and just like that we're able to add our pcie device cool good stuff now let's see if we can get it to show up in Ubuntu in the virtual machine and run Rock m is normal when you're doing your setup don't worry if you get an error it just means that there was a kernel update for your install when you're doing you know I have to Insight pops up and says oh there's some system updates you need to update not a problem just reboot rerun the installer it's fine now as I Was preparing this video I've actually been working on this video for a while when I first set it up I got rock M 5.3 but I'm using Ubuntu 22.04 LTS because this is a fresh install I'm going through everything again for this video to make sure it's all still current and I ran into another issue that wasn't the kernel problem where you just update the kernel and reboot where it was an issue on GitHub where it's like dma underscore resv points to sequence is missing this is trying to build for the kernel this originally happened like back on 5.9 but this is happening again on 5.19 because they're changing the way the kernel Works internally once we're past all the dkms kerfuffle we're able to run the rock invalidation Suite you can just apt install Rockin validation suite and then run the Rocky and validation Suite to see what the output is and in our case because we've only passed through the one GPU we can see here 42 Watts running currently 300 watt power cap currently 39 degrees C not bad if you want to go from here and install something like Docker or some other containerization or platform management thing or whatever there is actually pretty good support for that at this point with the rock M base there's still some rough edges but Andy's come a long way with rock M5 and on 5.4 that hard work has not gone unnoticed you know it sort of took a while to get support for Ubuntu 22.04 and there was a lot of uh you know sort of hard one Lessons Learned spinning up the stack and getting everything going amd's got to build momentum and build that gargantuan inertia for these kinds of things because this level of functionality and this infrastructure eventually will be able to move mountains by building this it means that system administrators and people that manage the systems can allocate virtual machines and assign parts of the GPU or the entire GPU to one research group or team or whatever and then that team is reasonably sandboxed uh you know with their virtual machines so they're not necessarily going to disturb other people that are using the same resources you don't have to have the same overhead and complexity of giving somebody dedicated Hardware resources dedicated GPU resources in this case because gpus are expensive compute gpus are expensive but you can sort of share the load among a whole bunch of different people that are running a whole bunch of different jobs Without Really interfering with one another which is nice and you get to use it with your existing VMware infrastructure which is also nice a lot of people look for that a lot of people look to be able to move the virtual machines between containers and all that kind of thing and it's a little tricky to do when we're talking about passing through pcie devices that makes things a little more complicated obviously because live migrating and running virtual machine when you're passing through a pcie device that's not going to work unless the the device Builder builds in a whole bunch of stuff to do State Management so that you could move a job running on a piece of Hardware from one piece of Hardware to another it works a little differently than networking and everything else so I don't know this has been a quick look at Rock M on the VMware platform getting that installed set up up and running and on our super micro big twin 2u2 node system rocking 32 core AMD epic Milan CPUs 256 gigs of memory per node and three Instinct Mi 210 graphics cards that is a ridiculous amount of Graphics horsepower if you're looking for something a little bit more pedestrian we do actually already have the stable diffusion guide on the level one Forum gigabuster put that together you should check that out this sort of dovetails with that this isn't that video there's another video coming about getting all of that running and oh boy those are fast shockingly fast it's all about the software stack though I'm windowless level one I'm signing out you can find me in the level 1 forums\n"