Apple M1 Ultra & NUMA - Computerphile

The Performance of Modern CPU Systems: Understanding Non-Uniform Memory Access

We're going to talk about 200 nanoseconds, I'm just making a number up here, it's a longer amount of time. In reality, it's longer than that because we'd have to go over here across the distributed shared memory link to get the value and then we could bring the value back. So rather than taking 100 nanoseconds, it would take in the order of 300 nanoseconds.

It will take a significantly longer amount of time, so if you build a computer system like this, we have the situation where depending on where an instruction is in memory or where data is in memory, it could either access it very quickly on this CPU core if it can go directly to the RAM that it's directly connected to. Or it would end up taking a long time relatively to access it because it'd have to go over the shared link and fetch it from the other block of RAM over there.

It would still appear to be the same memory system, but we've now got the situation where the access to it depends on which CPU is trying to access it. This is what's called a non-uniform memory access system. Originally, non-uniform memory access systems were the sort of domain of high-end cluster systems, such as SGI type workstations and things. But these days, you've actually seen it drop down onto workstation-type machines, some of the AMD Threadrippers, some of the higher-end Intel processors.

What this means is if you want to run that CPU at the fastest possible speed, you need to write your software to take into account which CPUs have fast access to which bits of RAM. So that you can put the data that those CPUs are processing and you can put the instructions that they're running in that block of memory, and have the CPU fetch the data that's being executed on these CPUs in this block of memory over here. So they can all access it very quickly and you only have a very small amount of data which is needed to synchronize things and keep things working.

Passing over the shared memory network is a crucial part of this system, as it allows for efficient communication between CPUs. Now, you can do it and it works great, but you have to write your software knowing where things are in fact. If you look, you can find papers and presentations from companies like Netflix, where they're really trying to optimize the performance of their servers to serve videos to users.

They've actually written about it in great detail, optimizing the speed of serving videos to users so they have to take into account that the network card is connected to one CPU, gets that data and doesn't have to go and pass it over the shared memory link to another one which then passes it over to another one to fetch the data from a hard disk. And so on.

And feed it back to you, and you get everything passing over the slow link all the time. Which brings us back to Apple's marketing buzzword of M1 Ultra Fusion. What have Apple done with M1 Ultra Fusion? Well, effectively, they've built a system like this. They've taken two M1 Max chips and glued them together.

So you've got two 10-core M1 Max chips each accessing their own block of memory or two blocks each, which is why you can get up to 128 GB on there because you've doubled the amount of CPU cores you can double the amount of memory that they can access. What they've built in the middle is just a very fast distributed shared memory link between the two.

I think what they've actually done is they've made it so fast that actually, the time it takes to go across from one CPU core to the other to get the value from the RAM and push it back into the CPU is so quick. The latency is so low that effectively it behaves as if it was a uniform memory access system.

It's fast enough that when the CPU requests the data, it gets it before it actually needs it. And which point it doesn't slow it down. So, as a programmer, we don't have to worry about where the data is in relation to the CPU cores.

Which one's attached to which core and things like that to make things run as fast as possible. We can just write our programs and let the operating system and the design of the hardware sort out the hard problems of executing it as fast as possible. Chunks and do them all at the same time, so one way for example to make sandwiches faster is that you butter the bread faster.

You put the filling in faster, you put the bread faster, the other person analysis of where I went wrong says Fred Brooks why did he make his name with that and what was it all about

"WEBVTTKind: captionsLanguage: enlast week i think it was maybe the week before apple had one of their usual press conferences and they announced their latest possibly last version of the m1 chip which was the m1 ultra and one of the things that they said as they launched it was that they designed it using two m1 max chips basically stuck together using something called ultra fusion to join them together now ultra fusion is just a marketing buzzword and literally all they've got is a high speed interconnect between the two silicon dies to transfer data between them but one of the things that they said which was interesting is that the reason they'd done this was so that you didn't have to write the software in a different way and i thought it was interesting just to pick up on that and to explain why if they hadn't made that interconnect fast enough you would have to write the software in a different way because if you think about it all they seem to be doing is adding more cores to the cpu making it a 20 core cpu instead of a 10 core cpu and you think well if it's a multi-processor system and if you've watched the videos we've done previously on multi-processor systems you're going to have to write the software to split the tasks up over the multiple calls to run so why are you not going to have to write things differently with this architecture of chip so i thought we'd have a look at that today so to understand what apple's done we need to go back to basics and think about how a computer actually works and we'll go with the von neumann model i know technically most modern cpus are modified harvard architecture but the von norman model is good for what we want to look at we have at the center of our system the cpu whatever we want and that is then connected to some memory and i'm just going to write ram here so it fits into the box of course some of it would be rom and other things and then the other thing that we have in there is we have the i o and things and that's basically the model we use for a computer we've got the cpu talking to the ram where the instructions and data are stored and you can talk to the i o to talk to the rest of the world so that's things like your disk controllers whether solid state hard disk your graphics card your network card now what happens when we have a multi-processor system the general way that we build multi-processor systems certainly the ones that we use in laptops are using desktop computers is using what's called a shared memory model so just as before with the von neumann architecture we're going to have a single block of ram and that's going to be connected not to one cpu now but we'll give it two cpus so we've got two cpus that it's connected to so it's connected to a shared bus between them and then each of those cpus are connected to it now effectively that's how you build a multi-processor system it's a bit more involved for example you need some sort of logic here for bus arbitration so we'll call that the ball the bus arbitration logic so you need something to sort of control well which cpu can talk to the ram at any one point now one thing i need to say here is that i've drawn this as the cpu talking directly to the ram if you think about it if you watch the video i did many years ago on cpu caches you need to have a cache here because otherwise only one cpu can ever talk to ram at the same time if there's no cache this cpu tries to talk to ram this one can't if that's the cpu tries to talk around that one can't at the same time it would effectively result in serializing the operation so you wouldn't get any speed up you need a cache in there and that sort of leads us to the first part of the problem only one cpu can access the ram at any one point now if we've got a cache in our system and i'm going to draw that as a red line which sits between the cpu and between the ram that's not a problem because as the cpu accesses data it stores a local copy in its cache so when it needs to try and fetch that data all those instructions again it can fetch them from the cache and not access the ram so that's absolutely fine most of the time we want to get it so the cpus are satisfying their data and instruction fetches from the cache and then only occasionally they go to the ram so that actually whenever one of the cpu goes to the ram needs to go to the main memory to fetch a value then effectively it's unlikely to be being used by the other occasionally you'll get the situation where they both try and access a value in main memory at the same time at which point that's why you have the bus arbitration logic to say this cpu is going to fetch the value then that cpu is going to fetch the value so we can build a shared memory multi-processor system like that i'm going to say relatively straightforwardly there's a lot involved but that's the basic idea of what's going on and we can extend that to have more cpus so we can just add another cpu in up here so we could have a three cpu system normally probably go up to four and things but i've run out of paper it's got its cache as well and you can extend that for as many cpus as you like except there is a slight issue we said that there are occasions where one cpu might be trying to access the memory at the same time as another cpu hopefully we can build the cache system we can bit load more data than we need each time we fetch things and so on we can build an intelligent memory system that can satisfy this so that the probability of that happening is relatively low but if we think about it if we add more and more cpus onto the same shared memory bus then we're going to end up with more chance of a collision happening of two cpus trying to access memory at the same time and the caches on each cpu mitigate that to some effect so that they reduce the probability of two things trying to access at a time but a bit like the old birthday problem you know the sort of question you ask if you've got a class of school children what is the probability that two of them share a birthday in there turns out it's quite likely once you get above about 20 or so children in the class the same thing applies here as you increase more cpus the chance that two of them will try and access memory at the same time increases as you add more cpus and so this will scale a scale but it will only scale up to a point once you get past a certain number of cpus you will find that you're back to the point where actually more it's more likely than not that two of them will be trying to access memory at the same time so we can scale this up to a certain number of cpus so does that form a limit is there a limit on how many cpus we can have working together in the multi-processor system well not as such because there's another way we can design a multi-process system so this is what's known as a uniform memory access system and the reason it's known as that is that for any location in ram any of these cpus can access it with the same sort of speed so it doesn't matter whether it's coming from cpu one up here or cpu3 down here it'll take the same amount of time for them to access the value in that memory location um different memory locations may have different speeds your rom might be slower than the ram you may have things mapped in there which are slower still and so on but for any particular memory location each cpu can access it in the same time all within the same nanosecond ballpark so it makes no difference in reality as we said that will scale up to a certain number of cpus but if we want to take it to beyond that then we need to change that system we need to build a system that no longer has uniform memory access rather than from a memory location each cpu being able to access at the same speed for each memory location the speed it takes to access it or how long it takes for it to access the data value there depends on which cpu core is trying to access it so it might be that for one cpu core it takes i don't know let's say 100 nanoseconds just picking a time off the top of my head but for another cpu core it takes 200 nanoseconds they're just ballpark times they're not older magnitude just shows there's a difference between the two okay let's have a look at how we build a system like that so what we're talking about is what's referred to as a non-uniform memory access system so non-uniform memory access system or numer for short so how does that differ well let's think about it it starts off in the same way we have a block of ram i'm going to turn the diagram around ram like that and that is connected to our cpus just as before i'm missing out the caches and the arbitration logic from this diagram just for simplicity so this looks relatively similar to what we had before we've got a some ram and some cpu calls to sharing access to it no difference there with a numerous system though we also have some other ram that's part of our system connected to a different set of cpu cores over here now at this point you've got effectively two computer systems these cpus can access this ram these cpus can access this ramp the difference in the numa system is that there is actually a link between the two systems here and you've got a distributed shared memory system think of it like a sort of network but it's often done at the cpu level and things even within on some between cores now what this means is as far as the program's running there is one block of memory there so this is if this was 16 gig and this was 16 gig the programs would see 32 gigabytes they're not separate blocks of memory uh they're seen by the programs as one block of memory but the difference is is if we've got a program running on this cpu over here it's got direct access to this block of memory here so let's say it takes i don't know let's say it takes 100 nanoseconds again to access memory so we've got 100 nanoseconds to access money if he wants to access memory in here it will take 100 nanoseconds to access that memory value but if the data it's trying to access is over in this memory over here a cpu over here could access it in 100 nanoseconds but for this cpu over here cpu over here it's got to go over this distributed shared memory connection from this set of ram and this set of cpus to this set of ram and this set of cpu cores over here and that would take a significant amount of time i mean it would take 100 nanoseconds over here to get from here to here so to get from here to here plus this let's say this is 200 nanoseconds i'm just making a number up of here it's a longer amount of time i'm making these numbers off of the top of my head so don't take them as any sort of things other than so it's longer to go from here over here so we'd have to go over here across the distributed shared memory link to get the value and then we could bring the value back so rather than taking 100 nanoseconds it would take in the order of 300 nanoseconds it will take a significantly longer amount of time so if you build a computer system like this we have the situation where depending where an instruction is in memory or where data is in memory it could either access it very very quickly on this cpu core if it can go directly to the ram that it's directly connected to or it would end up taking a long time relatively to access it because it'd have to go over the shared link and fetch it from the other block of ram over there it would still appear to be the same memory system but we've now got the situation where the access to it depends on which cpu is trying to access it so we have what's called a non-uniform memory access system now originally non-uniform memory access systems were the sort of domain of high-end cluster systems sort of sgi type workstations and things but these days you've actually seen it drop down onto sort of workstation type machines some of the amd threadrippers some of the higher end intel presses or all pneuma based systems and what this means is if you want to run that cpu at the fastest possible speed you need to write your software to take into account which cpus have fast access to which bits of ram so that you can put the data that those cpus are processing and you can put the instructions that they're running in that block of memory and have the cpu date the cpu instructions and the data that's being executed on these cpus in this block of memory over here so they can all access it very very quickly and you only have a very small amount of data which is needed to synchronize things and keep things working passing over the shared memory network now you can do it and it works great but you have to write your software knowing where things are in fact if you look you can find papers and presentations from companies like netflix where they're really trying to optimize the performance of their servers to serve the videos to you i'm sure youtube's doing the same as well but netflix have actually written about it really optimize the speed of serving the videos to you so they actually have to take into all this account so that the network card is connected to one cpu gets that data and doesn't have to go and pass it over the shared memory link to another one which then passes it over to another one to fetch the data from a hard disk and so on and feed it back to you and you get everything's passing over the slow link all the time you really have to take into account where things are which brings us back to apple's marketing buzzword of m1 ultra fusion what have apple done with m1 ultra fusion well effectively they have built a system like this they've taken two m1 max chips and glued them together so you've got two 10 core m1 max chips each accessing their own blocks of memory or two blocks each which is why you can get up to 128 gig on there because you've doubled the amount of cpu cores you can double the amount of memory that they can access and what they've built in the middle the thing they call ultrafusion is just a very very fast distributed shared memory link between the two and i think what they've actually done is they've just made it so fast that actually the time it takes to go across from one cpu core to the other to get the value from the ram and push it back into the cpu is so quick the latency is so low that effectively it behaves as if it was a uniform memory access system it's fast enough that when the cpu requests the data it gets it before it actually needs it and which point it doesn't slow it down so it's a nice system because it means as a programmer we don't have to worry about where the data is in relation to the cpu cores which one's attached to which core and things to make things run as pos run as fast as possible we can just write our programs like the operating system and the design of the hardware sort out the hard problems of executing it as fast as possible chunks and do them all at the same time so one way for example to make sandwiches faster is that you butter the bread faster you put the filling in faster you put the bread faster the other person analysis of where i went wrong says fred brooks why did he make his name with that and what was it all aboutlast week i think it was maybe the week before apple had one of their usual press conferences and they announced their latest possibly last version of the m1 chip which was the m1 ultra and one of the things that they said as they launched it was that they designed it using two m1 max chips basically stuck together using something called ultra fusion to join them together now ultra fusion is just a marketing buzzword and literally all they've got is a high speed interconnect between the two silicon dies to transfer data between them but one of the things that they said which was interesting is that the reason they'd done this was so that you didn't have to write the software in a different way and i thought it was interesting just to pick up on that and to explain why if they hadn't made that interconnect fast enough you would have to write the software in a different way because if you think about it all they seem to be doing is adding more cores to the cpu making it a 20 core cpu instead of a 10 core cpu and you think well if it's a multi-processor system and if you've watched the videos we've done previously on multi-processor systems you're going to have to write the software to split the tasks up over the multiple calls to run so why are you not going to have to write things differently with this architecture of chip so i thought we'd have a look at that today so to understand what apple's done we need to go back to basics and think about how a computer actually works and we'll go with the von neumann model i know technically most modern cpus are modified harvard architecture but the von norman model is good for what we want to look at we have at the center of our system the cpu whatever we want and that is then connected to some memory and i'm just going to write ram here so it fits into the box of course some of it would be rom and other things and then the other thing that we have in there is we have the i o and things and that's basically the model we use for a computer we've got the cpu talking to the ram where the instructions and data are stored and you can talk to the i o to talk to the rest of the world so that's things like your disk controllers whether solid state hard disk your graphics card your network card now what happens when we have a multi-processor system the general way that we build multi-processor systems certainly the ones that we use in laptops are using desktop computers is using what's called a shared memory model so just as before with the von neumann architecture we're going to have a single block of ram and that's going to be connected not to one cpu now but we'll give it two cpus so we've got two cpus that it's connected to so it's connected to a shared bus between them and then each of those cpus are connected to it now effectively that's how you build a multi-processor system it's a bit more involved for example you need some sort of logic here for bus arbitration so we'll call that the ball the bus arbitration logic so you need something to sort of control well which cpu can talk to the ram at any one point now one thing i need to say here is that i've drawn this as the cpu talking directly to the ram if you think about it if you watch the video i did many years ago on cpu caches you need to have a cache here because otherwise only one cpu can ever talk to ram at the same time if there's no cache this cpu tries to talk to ram this one can't if that's the cpu tries to talk around that one can't at the same time it would effectively result in serializing the operation so you wouldn't get any speed up you need a cache in there and that sort of leads us to the first part of the problem only one cpu can access the ram at any one point now if we've got a cache in our system and i'm going to draw that as a red line which sits between the cpu and between the ram that's not a problem because as the cpu accesses data it stores a local copy in its cache so when it needs to try and fetch that data all those instructions again it can fetch them from the cache and not access the ram so that's absolutely fine most of the time we want to get it so the cpus are satisfying their data and instruction fetches from the cache and then only occasionally they go to the ram so that actually whenever one of the cpu goes to the ram needs to go to the main memory to fetch a value then effectively it's unlikely to be being used by the other occasionally you'll get the situation where they both try and access a value in main memory at the same time at which point that's why you have the bus arbitration logic to say this cpu is going to fetch the value then that cpu is going to fetch the value so we can build a shared memory multi-processor system like that i'm going to say relatively straightforwardly there's a lot involved but that's the basic idea of what's going on and we can extend that to have more cpus so we can just add another cpu in up here so we could have a three cpu system normally probably go up to four and things but i've run out of paper it's got its cache as well and you can extend that for as many cpus as you like except there is a slight issue we said that there are occasions where one cpu might be trying to access the memory at the same time as another cpu hopefully we can build the cache system we can bit load more data than we need each time we fetch things and so on we can build an intelligent memory system that can satisfy this so that the probability of that happening is relatively low but if we think about it if we add more and more cpus onto the same shared memory bus then we're going to end up with more chance of a collision happening of two cpus trying to access memory at the same time and the caches on each cpu mitigate that to some effect so that they reduce the probability of two things trying to access at a time but a bit like the old birthday problem you know the sort of question you ask if you've got a class of school children what is the probability that two of them share a birthday in there turns out it's quite likely once you get above about 20 or so children in the class the same thing applies here as you increase more cpus the chance that two of them will try and access memory at the same time increases as you add more cpus and so this will scale a scale but it will only scale up to a point once you get past a certain number of cpus you will find that you're back to the point where actually more it's more likely than not that two of them will be trying to access memory at the same time so we can scale this up to a certain number of cpus so does that form a limit is there a limit on how many cpus we can have working together in the multi-processor system well not as such because there's another way we can design a multi-process system so this is what's known as a uniform memory access system and the reason it's known as that is that for any location in ram any of these cpus can access it with the same sort of speed so it doesn't matter whether it's coming from cpu one up here or cpu3 down here it'll take the same amount of time for them to access the value in that memory location um different memory locations may have different speeds your rom might be slower than the ram you may have things mapped in there which are slower still and so on but for any particular memory location each cpu can access it in the same time all within the same nanosecond ballpark so it makes no difference in reality as we said that will scale up to a certain number of cpus but if we want to take it to beyond that then we need to change that system we need to build a system that no longer has uniform memory access rather than from a memory location each cpu being able to access at the same speed for each memory location the speed it takes to access it or how long it takes for it to access the data value there depends on which cpu core is trying to access it so it might be that for one cpu core it takes i don't know let's say 100 nanoseconds just picking a time off the top of my head but for another cpu core it takes 200 nanoseconds they're just ballpark times they're not older magnitude just shows there's a difference between the two okay let's have a look at how we build a system like that so what we're talking about is what's referred to as a non-uniform memory access system so non-uniform memory access system or numer for short so how does that differ well let's think about it it starts off in the same way we have a block of ram i'm going to turn the diagram around ram like that and that is connected to our cpus just as before i'm missing out the caches and the arbitration logic from this diagram just for simplicity so this looks relatively similar to what we had before we've got a some ram and some cpu calls to sharing access to it no difference there with a numerous system though we also have some other ram that's part of our system connected to a different set of cpu cores over here now at this point you've got effectively two computer systems these cpus can access this ram these cpus can access this ramp the difference in the numa system is that there is actually a link between the two systems here and you've got a distributed shared memory system think of it like a sort of network but it's often done at the cpu level and things even within on some between cores now what this means is as far as the program's running there is one block of memory there so this is if this was 16 gig and this was 16 gig the programs would see 32 gigabytes they're not separate blocks of memory uh they're seen by the programs as one block of memory but the difference is is if we've got a program running on this cpu over here it's got direct access to this block of memory here so let's say it takes i don't know let's say it takes 100 nanoseconds again to access memory so we've got 100 nanoseconds to access money if he wants to access memory in here it will take 100 nanoseconds to access that memory value but if the data it's trying to access is over in this memory over here a cpu over here could access it in 100 nanoseconds but for this cpu over here cpu over here it's got to go over this distributed shared memory connection from this set of ram and this set of cpus to this set of ram and this set of cpu cores over here and that would take a significant amount of time i mean it would take 100 nanoseconds over here to get from here to here so to get from here to here plus this let's say this is 200 nanoseconds i'm just making a number up of here it's a longer amount of time i'm making these numbers off of the top of my head so don't take them as any sort of things other than so it's longer to go from here over here so we'd have to go over here across the distributed shared memory link to get the value and then we could bring the value back so rather than taking 100 nanoseconds it would take in the order of 300 nanoseconds it will take a significantly longer amount of time so if you build a computer system like this we have the situation where depending where an instruction is in memory or where data is in memory it could either access it very very quickly on this cpu core if it can go directly to the ram that it's directly connected to or it would end up taking a long time relatively to access it because it'd have to go over the shared link and fetch it from the other block of ram over there it would still appear to be the same memory system but we've now got the situation where the access to it depends on which cpu is trying to access it so we have what's called a non-uniform memory access system now originally non-uniform memory access systems were the sort of domain of high-end cluster systems sort of sgi type workstations and things but these days you've actually seen it drop down onto sort of workstation type machines some of the amd threadrippers some of the higher end intel presses or all pneuma based systems and what this means is if you want to run that cpu at the fastest possible speed you need to write your software to take into account which cpus have fast access to which bits of ram so that you can put the data that those cpus are processing and you can put the instructions that they're running in that block of memory and have the cpu date the cpu instructions and the data that's being executed on these cpus in this block of memory over here so they can all access it very very quickly and you only have a very small amount of data which is needed to synchronize things and keep things working passing over the shared memory network now you can do it and it works great but you have to write your software knowing where things are in fact if you look you can find papers and presentations from companies like netflix where they're really trying to optimize the performance of their servers to serve the videos to you i'm sure youtube's doing the same as well but netflix have actually written about it really optimize the speed of serving the videos to you so they actually have to take into all this account so that the network card is connected to one cpu gets that data and doesn't have to go and pass it over the shared memory link to another one which then passes it over to another one to fetch the data from a hard disk and so on and feed it back to you and you get everything's passing over the slow link all the time you really have to take into account where things are which brings us back to apple's marketing buzzword of m1 ultra fusion what have apple done with m1 ultra fusion well effectively they have built a system like this they've taken two m1 max chips and glued them together so you've got two 10 core m1 max chips each accessing their own blocks of memory or two blocks each which is why you can get up to 128 gig on there because you've doubled the amount of cpu cores you can double the amount of memory that they can access and what they've built in the middle the thing they call ultrafusion is just a very very fast distributed shared memory link between the two and i think what they've actually done is they've just made it so fast that actually the time it takes to go across from one cpu core to the other to get the value from the ram and push it back into the cpu is so quick the latency is so low that effectively it behaves as if it was a uniform memory access system it's fast enough that when the cpu requests the data it gets it before it actually needs it and which point it doesn't slow it down so it's a nice system because it means as a programmer we don't have to worry about where the data is in relation to the cpu cores which one's attached to which core and things to make things run as pos run as fast as possible we can just write our programs like the operating system and the design of the hardware sort out the hard problems of executing it as fast as possible chunks and do them all at the same time so one way for example to make sandwiches faster is that you butter the bread faster you put the filling in faster you put the bread faster the other person analysis of where i went wrong says fred brooks why did he make his name with that and what was it all about\n"