BEAST & The GPU Cluster - Computerphile

**Managing and Utilizing GPU Resources**

One of the key considerations when utilizing GPUs is defining how many gpus are required for a particular job. This can be done by specifying the desired number of gpus, as well as optionally requesting specific memory requirements and the number of CPU calls needed. However, it's essential to be aware that excessive requests may result in a "no" response from the system, indicating that the requested resources are not available.

**Distributed Computing and Job Submission**

The process begins when a user logs into a login node, such as Xavier, via SSH and submits their job using the `sbatch` command. The software automatically distributes the job to an appropriate compute node, which is then executed under the hood without requiring direct knowledge from the user. This ensures that the user is not burdened with complex details about the underlying infrastructure.

**Networking and Storage Considerations**

The communication between nodes occurs over separate networks, with the login node communicating with compute nodes via a 10-gigabit Ethernet connection. Meanwhile, storage access requires a different network, denoted by a distinct color in the diagram. This is because the file server's two links are aggregated using link aggregation control protocol (LACP), which determines load balancing for each link. The resulting configuration must be carefully managed to ensure stability and reliability.

**Challenges and Solutions**

One of the significant challenges encountered was upgrading the network drivers for 10-gigabit Ethernet cards on all machines. This required disabling power management for certain systems, as well as performing kernel patches and driver installations. To mitigate these issues, a configuration management software was employed to maintain a static configuration across all machines. This approach facilitated troubleshooting, upgrades, and administration.

**Operating System and Architecture**

The operating system utilized is CentOS 7, with the latest kernel versions applied. Interestingly, this has led to some debate about whether other options, such as Free BSD, might be more suitable for certain applications. In this case, Free BSD was chosen due to its perceived maturity and suitability for critical research applications.

**Legacy Considerations**

A notable aspect of this system is that it does not perform classification based on little-endian or big-endian systems. This characteristic is rooted in historical influences, such as Jonathan Swift's "Gulliver's Travels," where the term "endianness" originates.

"WEBVTTKind: captionsLanguage: enthere's normally an alarm turned on here but things have been in and out so much left so down here are all the cvl gpu well it's quite noisy in here should we go somewhere else yes joe tell me what it is you've been doing on this particular project so i'm a technician here my official job title is it systems engineer but i do a lot of different things within the school um everything from sort of health and safety all the way through to designing gpu computing clusters apparently yeah so i'm actually a researcher here i shouldn't be looking after computers probably but i enjoy it and so yeah i like to look after the software installed on these machines and play around a little bit with the networking side and also some of the hardware but i think uh joe is the real hardware guy i believe you guys need guys to talk to about beast and its siblings would that be right yeah we seem beast in the password cracking video now this is well it's an ssh terminal but this is beast things have moved on since then what's going on now right now we have three generations of gpu compute machines in in the school beast and rogue are the first generation of those machines is the naming regime based on anything specific definitely x-men but i know nothing about it i ask people what they want to name them when we have votes or something like that so yeah the last few have been voted on haven't they but but yeah before that was a bit more autocratic and we just chose the shortest names that we could find in the in the x-men character list on wikipedia yeah there are well over 100 characters that we could choose from at the moment as far as i remember but we've got quite a few spare we're in the computer vision lab and there's probably i don't know 13 to 20 people who are actively doing research here in deep learning so uh there's definitely a need for some kind of computing power here and in many research groups they might just you know install a couple of gpus in in everybody's workstation the problem with that is that many people don't need a gpu all the time and when they need more than one or two gpus what do you do you it makes sense to share these resources in sort of a cluster we originally started off designing this infrastructure in a way that was quite home brew if you like we wanted to prove to the research group that this kind of hardware could produce good results um you know impactful results if you like and um so we built the first two generations of machines wasn't it the first generation of beast and rug though based off kind of a workstation platform x99 single processor with four gpus maximum we then uh moved on to a dual processor platform intel uh xeon based there are also four gpu versions also four some of them got three some four depending on the configuration make this couple of them were limited by their cases weren't they yeah so that was deadpool which is mike's machine and we changed the case for that just so we could fit an extra gpu we started off with as joe said earlier you know just beast and rogue they became very popular they were utilized almost continuously over time obviously deep learning has become much more popular hearing about in the news almost every week more more of the research is in the lab more of the phd students are moving towards deep learning from traditional computer vision methods um so the requirement for our hardware has increased significantly um we started off with just these two machines and uh as we've grown it's become almost impossible to manage and to decide which machine to use uh we've entered more of a cluster state now so we have a login node which you log in to submit your jobs from there the scheduler decides which machine your job should be run on depending on what it requires in terms of number of gpus perhaps memory uh even you can have a constraint such as does it have 10 gigabit networking for example so yeah it works in a very similar way to a traditional hpc system we tried a couple of other methods of um scheduling gpu like being one of them yeah kind of informal methods they didn't really work out so we had to we had you know there was some kind of uh awkwardness let's say around around those discussions so we had to come up with a system that was fairer um and yeah so aaron had a look into some of the some of the options and uh you settled on slime didn't you no real uh reason for for using slurm it seems to have very nice support in terms of gpu constraints so you can assign a gpu to a user nobody else can touch it for the time that the job is allocated to the to the gpu i think similar things are possible in in older systems such as pbs which was used on the old university hpc that we have here um but slurm seems to be the the way that this sort of technology is moving slurm is an open source scheduling system it's named after the drink and future armor i can't remember what it stands for i think they might have done kind of a background thing and named it afterwards slurm basically decides when your job can run so there's a queue when you submit a job it goes into this queue as as the resources become available which hopefully is almost immediately because we have limitations on how many or how much resources each user can use the problem of course with centralizing like the the login is that you also need to somehow centralize the storage so at the moment we still actually have the ssds installed in each machine there's we call them dp1 db2 db3 etc and there's anywhere between one and three of those in each gpu node but when you're training a convolutional neural network you need to be loading your data quickly and you need to load it many times basically typically we would decide which machine we wanted to use and log into that because we know that our data is stored there problems with this is that you end up duplicating your cop your data across multiple machines sometimes the machine that you want to use is just busy so you have to use another one so you copy the data across so you're wasting space and it's just annoying to deal with so if you can centralize this storage for training your networks uh you don't have to worry about you know which machine should i use i'm just going to log into one machine and it will figure out where it should run and you know the data is all stored in the same place now obviously when we had the previous machines they all had their own previous individual ssds and you know you have the standard sata three speeds of six gigabits per second to each one of those uh if you're trying to centralize that you need something which can handle roughly the same performance as all of these combined you're never going to get there but you can try and get somewhere close to it um so we're using a pretty decent file system called zfs it is very featureful one of the nicest things in our use case is that it has a caching layer um so for our cache we're using ssds this is wolverine and then attached to that is an external jpot so this stands for just a bunch of disks basically there's one two three four five six seven eight nine ten eleven twelve this jbod is attached to wolverine via uh it's called sass what does that say serial attach scuzzy that's the one serial attached because per channel of sas 3 you get 12 gigabits per second the discs themselves can't handle 12 gigabits per second so this jbod is split in half you have database storage over here and in home directories as we're training networks this part of the storage is being read from continuously so when you're training you might have a data set of maybe 100 000 images you read them all once and during that time you're learning parameters for your network but then you do that again and again like 100 times basically so you read the same data 100 times so you don't want to access disks 100 times it's really slow so as the data is being read from db it's being copied onto the ssds installed into this machine and these are four terabyte each reading from an ssd you know much faster than spinning disks is that caching handled by the file system or do you have to write software to do the caching so zfs handles the caching automatically it uses actually it's called adaptive replacement cache and it uses two metrics to decide what should be cached what was most recently accessed and what's most frequently accessed and you try to balance this so it's 50 50 for each of them on our database storage we have a cash hit rate of about 60 to 75 depending on how jobs are starting um which means that you know 60 to 70 percent of the time we're not going to disc and waiting for a a you know platter to spin around we can just read the straight away from the ssd these are all configured into a single it's called a raids said two and that means that you could lose any two of these this so you could lose this one or this one the problem with razed too is that it limits the number of operations per second that you can have and that's why we're using it for home storage with db we actually split this again into two and this is one raids zed one and this is another raid z1 and these are striped together to get the same amount of storage as home but much faster so you can handle many more operations per second what it means though is that if you were to lose this disc and this disc you've lost all of this similarly if you lost this disc or this disc then you've also lost all of the data the only thing it can do is you could lose maybe one on this side and one on this side this obviously doesn't sound good but all you have to remember is that these databases are generally downloaded off the web and we have different file servers for storing stuff on a more permanent basis zfs has really nice it's called send receive you can do block level transfers of an entire data set um very quickly so you don't think of them as files you just think of them as copying the blocks from disk to it to another server so zfs gives lots of options for backups we've yet to implement them but we're still kind of transitioning over to this new storage so there are two layers of caching there's the l1 arc and l2x this is level one level two when you first read a file it will initially be cached into memory because it has to go through the cpu anyway to to go out of the network but this is kind of quickly uh written over especially when you're handling large amounts of traffic so the next level is level level two so these are written to the ssd this doesn't happen as you read the data but you know when it thinks that it's you know this file is being accessed a lot uh which is probably in terms of training a cnn after the second or third epoch so one epoch is once you've seen all of the data once yeah okay and that's made up like iterations where you might pass anywhere between one and like 30 or even more samples through the network and then yeah get the parameters what does the interface to all of this look like for the user is it just a is a command line thing or is there a gui sort of thing or uh yeah so you submit jobs using a command called sbatch and when you submit your job you specify what you want to run in basically a shell script and in this shell script you define how many gpus you want in particular but optionally also how much memory you require and how many cpu calls you need um is it likely to come back and say no no you're asking for too much it can do yeah so one of the things that can happen is if you ask for too many gpus either more than you're allowed or more than is available it will just sit in the queue either until your quota is increased or we buy a new gp machine with more gpus so you're sitting on a machine in your lab or whatever and you need to get a job done what happens talk us through it well so the user will log into the to the master node i suppose you would call it and submit their job with s batch that would then be picked up and distributed to the appropriate compute node by the software it's all kind of automatic and happens under the hood so it's abstracted away from the user so there's not really any um knowledge of what's going on required so we have login node which is called xavier we have several compute nodes so this is like b stroke gambit etc and then the storage over here thanks javi so here's a user at his or her computer and they will log in to javier via i say javier joe says xavier login via ssh and they submit their job when javier is ready it tells a particular compute node that you know i want you to run this job with this many gpus etc as this job is running it needs to access storage and somehow so this goes over a separate network that's why i'm using a different color this will bounce back and forth with packets of data until the job eventually finishes and the result is sent back to javier and then back to the user so this green is 10 gig and then this is the standard university one gigabit per second network and 10 gigabit is enough for those storage needs that you were saying before about not exactly so one of the things we have to do is on the file server to the switch we actually have two links and then to the gpu machines we have one link to each one so these are gpu machines and this is the file server and the way this works is that this is called a lag link aggregation it uses something called lacep which is the link aggregation control protocol and that protocol decides while communicating with the switch exactly what kind of load balancing it should do and it decides you know i'm going to use link one or link two and then at the switch it's it doesn't matter it kind of you know it only has one link to a particular gpu node so it just goes out of that one the biggest problem so far have been the network drivers for the 10 gigabit ethernet cards you had to upgrade it on all machines um and on certain machines even disable like this kind of second generation machines we had to disable power management for the pci bus yeah through through our configuration management software we we tend to maintain a fairly static configuration on all of the machines to make things as easy as possible to troubleshoot if necessary and to to upgrade and and so on and just to maintain and administer them in general but yeah we've had to do a few out-of-band patches kernel patches and some other driver installations in order to try and determine the most reliable configuration which we then have rolled out to the remaining machines so yeah i think we're it's it's uh stable now and uh a bit of a learning curve yeah absolutely we've had to learn on our feet and try not to disrupt the uh critical research applications that are happening but my own included yeah but no there hasn't been too much grumbling over over a few little hiccups and uh yeah like i say it's all hopefully plain sailing from here on out what's um what's the os would it be something people have recognized that's running on these yes uh centos seven um patched right yeah so yeah so they're all running latest uh kernel versions they actually are yeah the file so those running freebsd which might be an unusual choice for most people but uh i'm gonna get so much slack in the comments for this but i don't trust zfs on linux yet so yeah i mean even though it's been out for what six years as kind of production ready status we've just felt that perhaps freebsd was slightly more mature and uh a better choice for something that is um underpinning the entire operation you've got to train for a long time and let's not let steve off a hook right there steve over here high value of two high value of one whatever that means the interesting thing about this is we're not performing a classification little endian systems so that's why we call it endianness it all traces back to eggs of lilliput in gulliver's travelsthere's normally an alarm turned on here but things have been in and out so much left so down here are all the cvl gpu well it's quite noisy in here should we go somewhere else yes joe tell me what it is you've been doing on this particular project so i'm a technician here my official job title is it systems engineer but i do a lot of different things within the school um everything from sort of health and safety all the way through to designing gpu computing clusters apparently yeah so i'm actually a researcher here i shouldn't be looking after computers probably but i enjoy it and so yeah i like to look after the software installed on these machines and play around a little bit with the networking side and also some of the hardware but i think uh joe is the real hardware guy i believe you guys need guys to talk to about beast and its siblings would that be right yeah we seem beast in the password cracking video now this is well it's an ssh terminal but this is beast things have moved on since then what's going on now right now we have three generations of gpu compute machines in in the school beast and rogue are the first generation of those machines is the naming regime based on anything specific definitely x-men but i know nothing about it i ask people what they want to name them when we have votes or something like that so yeah the last few have been voted on haven't they but but yeah before that was a bit more autocratic and we just chose the shortest names that we could find in the in the x-men character list on wikipedia yeah there are well over 100 characters that we could choose from at the moment as far as i remember but we've got quite a few spare we're in the computer vision lab and there's probably i don't know 13 to 20 people who are actively doing research here in deep learning so uh there's definitely a need for some kind of computing power here and in many research groups they might just you know install a couple of gpus in in everybody's workstation the problem with that is that many people don't need a gpu all the time and when they need more than one or two gpus what do you do you it makes sense to share these resources in sort of a cluster we originally started off designing this infrastructure in a way that was quite home brew if you like we wanted to prove to the research group that this kind of hardware could produce good results um you know impactful results if you like and um so we built the first two generations of machines wasn't it the first generation of beast and rug though based off kind of a workstation platform x99 single processor with four gpus maximum we then uh moved on to a dual processor platform intel uh xeon based there are also four gpu versions also four some of them got three some four depending on the configuration make this couple of them were limited by their cases weren't they yeah so that was deadpool which is mike's machine and we changed the case for that just so we could fit an extra gpu we started off with as joe said earlier you know just beast and rogue they became very popular they were utilized almost continuously over time obviously deep learning has become much more popular hearing about in the news almost every week more more of the research is in the lab more of the phd students are moving towards deep learning from traditional computer vision methods um so the requirement for our hardware has increased significantly um we started off with just these two machines and uh as we've grown it's become almost impossible to manage and to decide which machine to use uh we've entered more of a cluster state now so we have a login node which you log in to submit your jobs from there the scheduler decides which machine your job should be run on depending on what it requires in terms of number of gpus perhaps memory uh even you can have a constraint such as does it have 10 gigabit networking for example so yeah it works in a very similar way to a traditional hpc system we tried a couple of other methods of um scheduling gpu like being one of them yeah kind of informal methods they didn't really work out so we had to we had you know there was some kind of uh awkwardness let's say around around those discussions so we had to come up with a system that was fairer um and yeah so aaron had a look into some of the some of the options and uh you settled on slime didn't you no real uh reason for for using slurm it seems to have very nice support in terms of gpu constraints so you can assign a gpu to a user nobody else can touch it for the time that the job is allocated to the to the gpu i think similar things are possible in in older systems such as pbs which was used on the old university hpc that we have here um but slurm seems to be the the way that this sort of technology is moving slurm is an open source scheduling system it's named after the drink and future armor i can't remember what it stands for i think they might have done kind of a background thing and named it afterwards slurm basically decides when your job can run so there's a queue when you submit a job it goes into this queue as as the resources become available which hopefully is almost immediately because we have limitations on how many or how much resources each user can use the problem of course with centralizing like the the login is that you also need to somehow centralize the storage so at the moment we still actually have the ssds installed in each machine there's we call them dp1 db2 db3 etc and there's anywhere between one and three of those in each gpu node but when you're training a convolutional neural network you need to be loading your data quickly and you need to load it many times basically typically we would decide which machine we wanted to use and log into that because we know that our data is stored there problems with this is that you end up duplicating your cop your data across multiple machines sometimes the machine that you want to use is just busy so you have to use another one so you copy the data across so you're wasting space and it's just annoying to deal with so if you can centralize this storage for training your networks uh you don't have to worry about you know which machine should i use i'm just going to log into one machine and it will figure out where it should run and you know the data is all stored in the same place now obviously when we had the previous machines they all had their own previous individual ssds and you know you have the standard sata three speeds of six gigabits per second to each one of those uh if you're trying to centralize that you need something which can handle roughly the same performance as all of these combined you're never going to get there but you can try and get somewhere close to it um so we're using a pretty decent file system called zfs it is very featureful one of the nicest things in our use case is that it has a caching layer um so for our cache we're using ssds this is wolverine and then attached to that is an external jpot so this stands for just a bunch of disks basically there's one two three four five six seven eight nine ten eleven twelve this jbod is attached to wolverine via uh it's called sass what does that say serial attach scuzzy that's the one serial attached because per channel of sas 3 you get 12 gigabits per second the discs themselves can't handle 12 gigabits per second so this jbod is split in half you have database storage over here and in home directories as we're training networks this part of the storage is being read from continuously so when you're training you might have a data set of maybe 100 000 images you read them all once and during that time you're learning parameters for your network but then you do that again and again like 100 times basically so you read the same data 100 times so you don't want to access disks 100 times it's really slow so as the data is being read from db it's being copied onto the ssds installed into this machine and these are four terabyte each reading from an ssd you know much faster than spinning disks is that caching handled by the file system or do you have to write software to do the caching so zfs handles the caching automatically it uses actually it's called adaptive replacement cache and it uses two metrics to decide what should be cached what was most recently accessed and what's most frequently accessed and you try to balance this so it's 50 50 for each of them on our database storage we have a cash hit rate of about 60 to 75 depending on how jobs are starting um which means that you know 60 to 70 percent of the time we're not going to disc and waiting for a a you know platter to spin around we can just read the straight away from the ssd these are all configured into a single it's called a raids said two and that means that you could lose any two of these this so you could lose this one or this one the problem with razed too is that it limits the number of operations per second that you can have and that's why we're using it for home storage with db we actually split this again into two and this is one raids zed one and this is another raid z1 and these are striped together to get the same amount of storage as home but much faster so you can handle many more operations per second what it means though is that if you were to lose this disc and this disc you've lost all of this similarly if you lost this disc or this disc then you've also lost all of the data the only thing it can do is you could lose maybe one on this side and one on this side this obviously doesn't sound good but all you have to remember is that these databases are generally downloaded off the web and we have different file servers for storing stuff on a more permanent basis zfs has really nice it's called send receive you can do block level transfers of an entire data set um very quickly so you don't think of them as files you just think of them as copying the blocks from disk to it to another server so zfs gives lots of options for backups we've yet to implement them but we're still kind of transitioning over to this new storage so there are two layers of caching there's the l1 arc and l2x this is level one level two when you first read a file it will initially be cached into memory because it has to go through the cpu anyway to to go out of the network but this is kind of quickly uh written over especially when you're handling large amounts of traffic so the next level is level level two so these are written to the ssd this doesn't happen as you read the data but you know when it thinks that it's you know this file is being accessed a lot uh which is probably in terms of training a cnn after the second or third epoch so one epoch is once you've seen all of the data once yeah okay and that's made up like iterations where you might pass anywhere between one and like 30 or even more samples through the network and then yeah get the parameters what does the interface to all of this look like for the user is it just a is a command line thing or is there a gui sort of thing or uh yeah so you submit jobs using a command called sbatch and when you submit your job you specify what you want to run in basically a shell script and in this shell script you define how many gpus you want in particular but optionally also how much memory you require and how many cpu calls you need um is it likely to come back and say no no you're asking for too much it can do yeah so one of the things that can happen is if you ask for too many gpus either more than you're allowed or more than is available it will just sit in the queue either until your quota is increased or we buy a new gp machine with more gpus so you're sitting on a machine in your lab or whatever and you need to get a job done what happens talk us through it well so the user will log into the to the master node i suppose you would call it and submit their job with s batch that would then be picked up and distributed to the appropriate compute node by the software it's all kind of automatic and happens under the hood so it's abstracted away from the user so there's not really any um knowledge of what's going on required so we have login node which is called xavier we have several compute nodes so this is like b stroke gambit etc and then the storage over here thanks javi so here's a user at his or her computer and they will log in to javier via i say javier joe says xavier login via ssh and they submit their job when javier is ready it tells a particular compute node that you know i want you to run this job with this many gpus etc as this job is running it needs to access storage and somehow so this goes over a separate network that's why i'm using a different color this will bounce back and forth with packets of data until the job eventually finishes and the result is sent back to javier and then back to the user so this green is 10 gig and then this is the standard university one gigabit per second network and 10 gigabit is enough for those storage needs that you were saying before about not exactly so one of the things we have to do is on the file server to the switch we actually have two links and then to the gpu machines we have one link to each one so these are gpu machines and this is the file server and the way this works is that this is called a lag link aggregation it uses something called lacep which is the link aggregation control protocol and that protocol decides while communicating with the switch exactly what kind of load balancing it should do and it decides you know i'm going to use link one or link two and then at the switch it's it doesn't matter it kind of you know it only has one link to a particular gpu node so it just goes out of that one the biggest problem so far have been the network drivers for the 10 gigabit ethernet cards you had to upgrade it on all machines um and on certain machines even disable like this kind of second generation machines we had to disable power management for the pci bus yeah through through our configuration management software we we tend to maintain a fairly static configuration on all of the machines to make things as easy as possible to troubleshoot if necessary and to to upgrade and and so on and just to maintain and administer them in general but yeah we've had to do a few out-of-band patches kernel patches and some other driver installations in order to try and determine the most reliable configuration which we then have rolled out to the remaining machines so yeah i think we're it's it's uh stable now and uh a bit of a learning curve yeah absolutely we've had to learn on our feet and try not to disrupt the uh critical research applications that are happening but my own included yeah but no there hasn't been too much grumbling over over a few little hiccups and uh yeah like i say it's all hopefully plain sailing from here on out what's um what's the os would it be something people have recognized that's running on these yes uh centos seven um patched right yeah so yeah so they're all running latest uh kernel versions they actually are yeah the file so those running freebsd which might be an unusual choice for most people but uh i'm gonna get so much slack in the comments for this but i don't trust zfs on linux yet so yeah i mean even though it's been out for what six years as kind of production ready status we've just felt that perhaps freebsd was slightly more mature and uh a better choice for something that is um underpinning the entire operation you've got to train for a long time and let's not let steve off a hook right there steve over here high value of two high value of one whatever that means the interesting thing about this is we're not performing a classification little endian systems so that's why we call it endianness it all traces back to eggs of lilliput in gulliver's travels\n"