AI and Unified Memory Architecture - Is it in the Hopper Is it Long on Promise, Short on Delivery

**Nvidia's Gaming Dominance and The Future of Computing**

The world of computing is rapidly evolving, with companies like Nvidia leading the charge in innovation. Their recent advancements in technology have left many wondering what's next for the gaming industry. One particular example that has sparked interest is Nvidia's use of HBM2E memory, which has been touted as a major improvement over its predecessor, HBM2.

However, when it comes to the actual experience of using these technologies, some users may not notice a significant difference. As one user put it, "it's just not in the same way that those zones with HBM2E was not really amazing for most things." This sentiment is echoed by many who have had mixed experiences with Nvidia's recent releases.

Despite this, Nvidia has been working tirelessly to ensure that developers can harness the full potential of their technologies. Their focus on making everything transparent to the end user has helped them maintain a strong grip on their development ecosystem. As one industry expert noted, "if they fumble this in the slightest, they're going to lose some of their grip on their development ecosystem."

So, what exactly are these advancements, and how do they impact users? Nvidia's next-generation platform is designed to address latency and scalability issues, allowing for a more seamless experience. This new platform ties into the promise of Nvidia's unified architecture, where the GPU has different tiers of memory speed.

The concept of "aggregated" hardware, where multiple GPUs work together in a distributed pool with low latency, is also an exciting development. This allows users to tap into the collective power of multiple GPUs without duplicating information across devices. For developers, this means a more robust and efficient platform for creating AI-powered applications.

Meanwhile, AMD has been making strides in Open Compute, partnering with companies like Google to create innovative solutions. However, the company's efforts have not gone unnoticed by Nvidia, which is determined to maintain its lead in the market. As one industry expert noted, "Open Compute players are not wanting to work together as much as they should," citing concerns over individual competitive advantages.

This hesitation has put AMD and other Open Compute participants at a disadvantage. Without access to Nvidia's unified architecture and cutting-edge technologies like HBM3E memory, it remains to be seen how they will close the gap. The recent release of ChatGPT on Instinct systems from AMD is an example of the company's efforts, but whether it will be enough to change the tide is uncertain.

In other news, Nvidia has been working on a new laptop design that incorporates multiple GPUs and high-speed memory. This concept, known as the Mi 300a, has generated excitement among developers and gamers alike. However, the device failed to make an appearance at Computex, leaving many wondering what's in store for future releases.

One industry expert is optimistic about Nvidia's design with Grace Hopper, which promises a high-speed vram solution for consumer devices. This could have significant implications for gaming performance and overall user experience. As one thought-provoking idea noted, "why not just put all that together and make the operating system look at it and say okay, I should use the high-speed vram for the game"?

Finally, Nvidia's unified architecture is set to take center stage as a future development focus. With the company exploring ways to integrate CPUs, GPUs, and memory into a single pool of resources, users can expect a seamless experience across various applications. As one enthusiast put it, "I think my next step there's a certain company out there that is doing some really wild stuff in terms of putting a pool of x86 resources together in a high-speed cluster."

This idea of a server without a BIOS, where the CPU is simply turned on and fed into the network, represents an exciting future for computing. Nvidia is already exploring this concept, and it's likely to be a major player in the company's upcoming technologies.

**The Future of Computing: Where Will It Take Us?**

As we look to the future of computing, it's clear that Nvidia is leading the charge in innovation. With their advancements in HBM2E memory and unified architecture, they're pushing the boundaries of what's possible. While challenges lie ahead, including competition from AMD and Open Compute, Nvidia remains a dominant force in the industry.

The concept of "aggregated" hardware, where multiple GPUs work together in a distributed pool with low latency, is an exciting development that could revolutionize the way we think about computing. By tapping into the collective power of multiple GPUs without duplicating information across devices, developers can create more robust and efficient applications.

Meanwhile, Nvidia's unified architecture promises to bring a new level of integration to consumer devices, allowing users to tap into high-speed vram for gaming and other applications. This could have significant implications for performance and overall user experience.

As we look to the future, it's clear that computing is rapidly evolving. With companies like Nvidia leading the charge in innovation, it's exciting to think about what's next. Will we see a major shift towards "aggregated" hardware? Or will Nvidia's unified architecture be the key to unlocking new levels of performance and efficiency?

Only time will tell, but one thing is certain: the future of computing is bright, and Nvidia is at the forefront of this revolution.

"WEBVTTKind: captionsLanguage: enhey everybody I'm back from computex and uh I managed to sneak out an MI 300 not really but I mean kind of sort of let's let's talk about the Mi 300a this is not a 300a this is 300X 300a 300X x86 cores make the a the a the a this is not actually a video about the Mi 300a this is a video about what I saw with nvidia's Grace Hopper and unified system architecture and maybe some riding on the wall for software developers this is really about the software experience or maybe the ecosystem experience but there's kind of a lot of things we can touch on and I have questions but also I want to share my thoughts so let's dive in the Mi 300a this chip is a bit unique because AMD has integrated x86 cores directly into the GPU and it uses hbm memory so it's a it's a bold move and and I see it as part of a broader Trend toward unified system architecture or maybe it's going to be an experiment that goes down in history as legendary because it is absolutely a legendary piece of Hardware but never actually takes off we can see similar efforts from Apple with their M series I it's an arm architecture it's an arm CPU under the hood but maybe x86 will win in the end I don't know I mean I was surprised that nvidia's Grace Hopper super chip is more or less the same system architecture as an x86 server loaded with a bunch of gpus if you look at the the diagram from Nvidia which we're going to refer back to a bunch in this video okay I mean it's really not that different it's not a dramatic change I think some of the the changes that we see from the m300a May indicate that we're nearing an inflection point in the industry and the more I peel back the layers on Grace Hopper the more I see evidence that there is some architectural inflection point stuff like I'm talking about in Grace Hopper but the block diagram doesn't make that super obvious what I'm talking about is how we approach Computing at scale and really how developers are expected to interface with that I mean there's a little bit of a catch here the hardware Innovations are here but the software is maybe the part that's not quite here yet or maybe it's in a transitional phase or maybe some of the server design papers over some of the architectural changes and this is all things that I find very fascinating you really need to think outside the box with system architecture and disaggregation of hardware for our infrastructure as code future but I think some of what has driven nvidia's choices here and some of the larger choices in the industry even from open compute is around the software development experience and making it to where developers don't have to worry too much about it which is also not great because a lot of performance can be left on the table by that like developers get it working they don't like developers that can also do optimization are much more rare than just developers so for those of you in software development pay attention I think or uh ask me questions or give me stuff that I'm not necessarily thinking about this trend toward unified system architecture or unified memory architecture could significantly impact your work and what I'm talking about is making it so abstract that what is GPU memory and what is CPU memory basically doesn't matter uh some of that in Grace Hopper doesn't necessarily have to do with the fact that it's not technically a unified system memory architecture I mean okay actually on the GPU side with the Envy link thing you could say that there is a unified architecture thing going on there but just by virtue of the fact you can link 256 gpus together but that's not the same thing as system memory and GPU memory being linked together system memory is slow GPU memory is fast and as a developer if I can have both slow and fast memory that's what makes the computer go that's the whole like having small amounts of insanely fast memory and large amounts of slower memory in tears is what drove the Entire Computer Revolution since the dawn of time and we have an opportunity to do that here and it is in nvidia's Hardware ah I'm getting ahead of myself bottom line nvidia's stock is so insane precisely because of the promise that the system architecture is not going to change that much and the promise is to developers that developers are going to experience a software ecosystem that from their perspective basically won't change and this rack scale ball of wires this cabinet signed by Jensen himself that we saw at computex is going to present to you the developer as a single GPU we saw these cabinets from gigabyte and MSI and azck and super micro and metac literally anybody who has the capability to make a Server Chassis is making bank with Nvidia because they cannot make these fast enough this is their system architecture all the way from the Grace Hopper super chip all the way up to the EnV link interconnects and even things Beyond rack scale it really is wild like how much money Nvidia is spending having these things built as quickly as possible now check out the block diagrams provided by Nvidia Developers for the Grace Hopper super chip the grace CPU supports up to 512 GB of LP ddr5x while the hopper GPU can handle up to 96 GB of hbm3 the interconnect speeds here are impressive the GPU can connect directly to the gray CPU or up to a network of 255 other Hopper gpus at 900 gab per second however the GPU memory interface that's the star of the show 3 terabytes per second that's that hbm 3E notice the CPU interconnect is just over a tenth of that even our ddr5 memory interface is Plucky by comparison now also take a look at the physical stuff for this system that I saw at computex they're moving all the complexity onto the CPU and GPU packages themselves I mean the super chip is relatively small here it's tiny this this complexity is all around the interface and this system does have a mix of lpddr5 and and hbm 3E slower system memory and faster system memory that's sort of what I'm talking about contrast this to Apple's M architecture M1 M2 M3 M4 those chips have on package memory but largely Apple hasn't taken advantage of the performance benefits that can be conferred from having on package memory it's mostly power savings that Apple has focused on and yet as I'm sure that the comment section will attest the folks that shelled out for the 96 gigabyt of memory and Beyond Apple laptops oh you might have overpaid there sorry all able to run large language models because of the unified memory architecture CPU memory GPU memory there's not really a distinction because it really physically is the same memory and so Apple's arm really was B basically able to just say hey uh let's just not move things around it's not terribly fast but it does work and it is generally a better experience on PC but it's a unified memory architecture nothing has to move between memory because the memory is literally the same and part of the reason the performance so good is because Apple's memory controller genuinely is really good it doesn't have a bottleneck and the caching system is quite good on the Apple arm M architecture and so it is a pretty reasonable experience even given that it's only lpddr5 and best case scenario you're not going to be anywhere near the bandwidth of like an hbm type solution so we're talking about memory bandwidth that is at best you know 25% less than what you would experience on the Nvidia CPU arm side of it now Intel's lunar Lake and amd's Hawk Point are trying to make this unified experience more feasible on PC mainly driven by npus like their neural Processing Unit it's still a work in in in in progress of course but part of the idea with resizable bar and making the mpu driver architecture based on wddm in the case of Intel is to try to better leverage access to data in the context of a GPU operation or CPU operation or npu operation to just make it a memory operation because it's all the same memory at the end of the day when we're talking about a portable system at least one that doesn't have a discreet GPU now let's let's let's bring that back to the Mi 300a it's x86 cores but the memory is on package and it's hbm 3E so the x86 cores are fed by this absurdly fast hbm 3E memory but x86 cores aren't really designed for that uh Patrick it served the home did a really nice expose on the Intel xeons that have hbm2e he got his hands on those and was able to do a deep dive and honestly I was expecting a lot more insane killer stuff to come from those it hasn't really materialized you could run those zons with only the hbm2e memory or with interner leaving or you could take over in software and have the software do caching and fun stuff I'm sure there's specialty software out there that now depends heavily on those CPUs but in terms of like the entire world is beating down your door to get those CPUs not so much and it would seem to be the case also with the Mi 300a now the Mi 300a does it seem the Mi 300a systems are disadvantaged with no dims there's no where to physically add memory I mean that might be a disadvantage but also cxl might bridge the gap and so this is a little bit of a background to say hey cxl could thread the needle for AMD customers and Nvidia customers both now Nvidia is not going to like that cxl can thread the needle and so this is the part you should pay attention to but I think that cxl as a way of expanding memory Beyond hbm3 could also be an interesting thing for AMD to do strategically like if AMD enabled better cxl functionality it could also kneecap Nvidia in a way that they frown upon let me explain cxl or compute express link has been in development for a while it's it's it's really cool in that it enables whole new branches of computer science for managing information and continuity in the case of of failure you can have a cxl device shared among multiple physical hosts the database people were really excited about this for being able to use cxl devices for assd compliance and transaction log uh save and and replay and it really does unlock some new computer science there because you're assured that the transactions make it off of system to another system but it's also low latency high speed high coherency and amd's done a lot of strategic Acquisitions for interconnect companies fpgas and you know pensando and all of this goes together but is getting a little beyond the scope of what I want to talk about in this video just know the cxl is highp speed in terms of it looks like a pcie Gen 5 device it's the the latency of pcie Gen 5 and so it's highs speed but when I say high speed you should think of high speed in the context of that Grace Hopper diagram where the numbers on on the left you're like okay but then you got three terabytes per second on the right it's like that's a whole other ball game in terms of speed what if there existed a cxl device that would allow you to expand the hbm memory okay yeah I mean the pcie speed is basically the same as a memory controllers you're not really at a terrible speed disadvantage when we're talking about Native ddr5 memory channels you're just at a dis disadvantage when we talk about hbm2e what if this is a way for NVIDIA customers to expand their vram that could be interesting that was another thing that I looked for at computex now keep in mind that Nvidia offers different configurations of Grace Hopper you can get dual Hopper no CPUs you know but it seems like in nvidia's architecture they're saying yeah 500 gabt of you know lowcost ddr5 to just shy of 100 gigabytes of of hbm 3E that's the good ratio we can feed it like that set of memory tiering we'll handle that in software developers won't have to worry about it it's going to make a lot of sense if Nvidia can get done in 96 GB of hbm 3E what will require AMD to use 128 or 288 gab of hbm 3E that is actually a competitive advant for NVIDIA but I think the reality is that the customers especially the open compute customers just want an absurd amount of the fastest memory you can get because all other variables don't matter that much which is interesting now the one shining star at computex in this thread of thoughts rattling around my brain was fison and their adaptive solution this is not Dam and it's not cxl they're using EnV pcie over nvme gpus can talk directly to nvme some of the plumbing and stuff is there for that on the Nvidia side as well and with fison adaptive solution they're able to take single level cell nand Flash and basically use it as a second tier of vram even with Nvidia based gpus so if you need to do a job that requires a terabyte of vram but you've only got four A1 100s that have you know 320 gigabytes of of memory you can make up the difference by throwing in some fison high performance nvmes and fison software and the software will be like you know ptor or whatever will be able to run and get that done like as if you have over a terabyte of vram this is a really interesting product so it can be done the other thing to look for at an operating system level is does your operating system have a concept of things like you have fast memory and slow memory and just a couple of years ago the answer to that was no but the Linux kernel has actually gotten support for more of these memory performance tiar not only that they've gotten really good support for things like Numa noes that don't have local compute but only have memory or that have local compute but no memory because that was the thing with 2000 series thread reper that's that's that's a video for another day but the fact that Linux support for that has become first class tells me that there are enough customers out there that are working on this kind of thing that yeah we could actually see systems that take more advantage of fast hbm 3E and slow ddr5 presumably Nvidia has that in their software stack but is proprietary probably and not open source and blah blah blah so it's really nice to see that on the Linux side because people will be able to take better advantage of that that also means the cxl devices plus an MI 300a should just work out of the box because those Zen cores should support cxl maybe it's an IOD die kind of a thing but if there was a you know one tbte ddr4 cxl module out there then boom that may be you're you're immediately good to go on your Mi 300a based system and you don't have any of the uh disadvantages of being locked into only hbm 3E memory don't know I think also based on the comments that I heard at compex with Nvidia spending so much money with so many partners I think those if Nvidia says M I don't know if we want to qualify a cxl device on an Nvidia based solution sorry um that could negatively impact how quickly we see a product like that come to Market that's Nvidia compatible meanwhile it's an amd's advantage to bring such a product to Market not only for the Mi 300a if they're interested in the Mi 300a but also to ensure that less Nvidia gpus get old possibly maybe in some parallel universe I mean okay possibly the other variable here is open compute so open compute is you know meta Facebook Amazon they dictate the hardware because all the hardware companies were playing games with how you were selling lots of Hardware to hyperscalers and so they've gotten to the point where they specified down to the chassis their Engineers are very good they're doing a lot of optimization the thing that I'm worried about is that I haven't seen quite as much uh involvement of open compute on the software side of it I'm sure that the software Engineers are are Whispering into the hardware Engineers ears no doubt about that but if you look at the meeting minutes and some of the open compute gett togethers uh Google and meta and Amazon are very very uh cards close to the chest when they're talking about their software stack and what their software needs in terms of enablement I'm sure that they share with Partners like AMD hey this is the software they we're working on under the hood under an N of course but that is not something that is really super openly discussed in those open compute meetings and I think that's putting the open compute members at a disadvantage because you possibly Miss Solutions like oh yeah we could just use cxl to to expand their vram space maybe or you know some other pcie type storage to expand their vram space maybe and solve some problems that we have immediately versus waiting for the next Hardware engineering cycle we also haven't really talked about how things that are far away increase latency and latency management also enters the equation here I mean if you look at the physical design of nvidia's racks they're putting all the fabric communication gear in the middle of the rack that's not for aesthetic reasons that's because we need to minimize the uh the signal propagation delay problems because speed of light it's real so yeah uh latency also factors in here and making the fabric as low latency as possible and there's there's some stuff that goes with that cxl has some stuff in it for hiding latency and doing coherency after the fact which is again black magic in terms of computer science but uh it's exciting because it's also a new computer science so anyway to wrap this up this is something that I wanted to be on your radar unified system architecture but also like software that can properly leverage fast and slow memory is a thing and software that can properly leverage fast slow memory unlocks a whole new set of possibilities without having to have a terabyte of hbm 3E on every GPU I don't think we really need that cuz we can't process that that quickly but this kind of is an inflection point and the inflection point is happening on mobile because unified system architecture on mobile saves power and just makes sense and so it's weird because For the First Time instead of data center Technologies trickling down to the consumer there may be some consumer software technologies that Trickle up to the developers running in the data center at least in terms of taking full advantage of a unified system architecture and then maybe it would make sense to have systems running on hbm 3E because right now hbm 3E in like a notebook you would expect that to be amazing but it's just not it's just not in the same way that those zons with hbm2e was not really amazing for most things it's just not it's just not most people don't realize that Nvidia has been working double plus overtime to ensure the developers don't actually feel the changes under the hood that exist with Blackwell and their next Generation platform that also ties into that promise that I was talking like line go up as long as Nvidia makes all of this transparent to the end user and that is 90% of the reason that Nvidia has gone up the way that it has their software moat if they fumble this in the slightest they're going to lose some of their grip on their development ecosystem because under the hood they had to make some changes in order to better address latency and scalability and everything else and Nvidia sees the solution as lots of GPU in a lot in a distributed pool with very low latency and so you get the best of all worlds in terms of an an enormous hbm 3E memory space because you don't have information duplicated across gpus and and everything else outside of Nvidia in the broader software ecosystem I think the rest of the players do still need to to play a little bit of catchup here and I think part of that is because the open compute players are not wanting to um work together all that much on the software because they want to maintain their their individual competitive Advantage they're they're really just not eager to share their software special sauce and this maybe puts AMD and other participants in open compute at a disadvantage because does it mean that that AMD has to solve these software problems and does it have to solve these software problems in a way that thread the needle that is not going to piss off their Enterprise customers I mean the Mii 300a could be the most amazing interesting way to run their Ai workloads and AMD has gotten some major wins through open compute with chat GPT I mean you you realize the chat GPT is running on Instinct systems from from AMD I honestly don't know if cxl would help with Mi 300a adoption it seems like Mi 300a should be you know something that's really Innovative but we didn't hear anything about it at computex and it could be that the Mi 325x or you know the Mi 300 successors are just going to overshadow the Mi 300a I mean nvidia's design with Grace Hopper is basically the traditional we have a a crap load of of ddr5 on tied to the CPU and we've got the the really high speed hbm 3E and the CPU is going to have to move things in and out and that's just the overhead that we're going to have to deal with maybe that's what it looks like in the real world but consumer devices that have one block I mean you're it doesn't make sense to have a laptop with 32 GB of dam and 8 gabt of vram like you should just put all that together and the operating system should look at it and say okay I should use the highs speed vram for the game the person is playing and when the person is not playing a game then we can use the high-speed vram for the operating system like that should just be part of the hardware architecture but that is way way easier said than done but it sure does look like consumer devices are going to drive some of that Innovation that may trickle back up to the Enterprise and make things a little easier on the Enterprise when we step back a little bit and we talk about you know disaggregated hardware and you have a pool of compute and is it CPU type compute is it GPU type compute what is your what are your containers needs I think my next step there's a certain company out there that is doing some really wild stuff in terms of putting a pool of x86 resources together in a highspeed cluster how about a server without a bios because the BIOS is too slow turn the CPU on feed it the agiza add it to the network and done that's that's a future after my own heart and likely a future video I've got to get out there to go see them soon and I think that some of the threads that I've started to pull on in this video uh we can pull on in some different ways in that video so I'm trying to work that out and be sure to look for that but yeah just some thoughts on where was the Mi 300a at computex but also how is NVIDIA presenting to developers something that's like a unified architecture but under the hood is in fact not and and is that a portant of things to come for our unified arm architecture future where the GPU does have fast and slow tiers of memory I don't know I'm one this level one I'm signing out you find me in the level one formshey everybody I'm back from computex and uh I managed to sneak out an MI 300 not really but I mean kind of sort of let's let's talk about the Mi 300a this is not a 300a this is 300X 300a 300X x86 cores make the a the a the a this is not actually a video about the Mi 300a this is a video about what I saw with nvidia's Grace Hopper and unified system architecture and maybe some riding on the wall for software developers this is really about the software experience or maybe the ecosystem experience but there's kind of a lot of things we can touch on and I have questions but also I want to share my thoughts so let's dive in the Mi 300a this chip is a bit unique because AMD has integrated x86 cores directly into the GPU and it uses hbm memory so it's a it's a bold move and and I see it as part of a broader Trend toward unified system architecture or maybe it's going to be an experiment that goes down in history as legendary because it is absolutely a legendary piece of Hardware but never actually takes off we can see similar efforts from Apple with their M series I it's an arm architecture it's an arm CPU under the hood but maybe x86 will win in the end I don't know I mean I was surprised that nvidia's Grace Hopper super chip is more or less the same system architecture as an x86 server loaded with a bunch of gpus if you look at the the diagram from Nvidia which we're going to refer back to a bunch in this video okay I mean it's really not that different it's not a dramatic change I think some of the the changes that we see from the m300a May indicate that we're nearing an inflection point in the industry and the more I peel back the layers on Grace Hopper the more I see evidence that there is some architectural inflection point stuff like I'm talking about in Grace Hopper but the block diagram doesn't make that super obvious what I'm talking about is how we approach Computing at scale and really how developers are expected to interface with that I mean there's a little bit of a catch here the hardware Innovations are here but the software is maybe the part that's not quite here yet or maybe it's in a transitional phase or maybe some of the server design papers over some of the architectural changes and this is all things that I find very fascinating you really need to think outside the box with system architecture and disaggregation of hardware for our infrastructure as code future but I think some of what has driven nvidia's choices here and some of the larger choices in the industry even from open compute is around the software development experience and making it to where developers don't have to worry too much about it which is also not great because a lot of performance can be left on the table by that like developers get it working they don't like developers that can also do optimization are much more rare than just developers so for those of you in software development pay attention I think or uh ask me questions or give me stuff that I'm not necessarily thinking about this trend toward unified system architecture or unified memory architecture could significantly impact your work and what I'm talking about is making it so abstract that what is GPU memory and what is CPU memory basically doesn't matter uh some of that in Grace Hopper doesn't necessarily have to do with the fact that it's not technically a unified system memory architecture I mean okay actually on the GPU side with the Envy link thing you could say that there is a unified architecture thing going on there but just by virtue of the fact you can link 256 gpus together but that's not the same thing as system memory and GPU memory being linked together system memory is slow GPU memory is fast and as a developer if I can have both slow and fast memory that's what makes the computer go that's the whole like having small amounts of insanely fast memory and large amounts of slower memory in tears is what drove the Entire Computer Revolution since the dawn of time and we have an opportunity to do that here and it is in nvidia's Hardware ah I'm getting ahead of myself bottom line nvidia's stock is so insane precisely because of the promise that the system architecture is not going to change that much and the promise is to developers that developers are going to experience a software ecosystem that from their perspective basically won't change and this rack scale ball of wires this cabinet signed by Jensen himself that we saw at computex is going to present to you the developer as a single GPU we saw these cabinets from gigabyte and MSI and azck and super micro and metac literally anybody who has the capability to make a Server Chassis is making bank with Nvidia because they cannot make these fast enough this is their system architecture all the way from the Grace Hopper super chip all the way up to the EnV link interconnects and even things Beyond rack scale it really is wild like how much money Nvidia is spending having these things built as quickly as possible now check out the block diagrams provided by Nvidia Developers for the Grace Hopper super chip the grace CPU supports up to 512 GB of LP ddr5x while the hopper GPU can handle up to 96 GB of hbm3 the interconnect speeds here are impressive the GPU can connect directly to the gray CPU or up to a network of 255 other Hopper gpus at 900 gab per second however the GPU memory interface that's the star of the show 3 terabytes per second that's that hbm 3E notice the CPU interconnect is just over a tenth of that even our ddr5 memory interface is Plucky by comparison now also take a look at the physical stuff for this system that I saw at computex they're moving all the complexity onto the CPU and GPU packages themselves I mean the super chip is relatively small here it's tiny this this complexity is all around the interface and this system does have a mix of lpddr5 and and hbm 3E slower system memory and faster system memory that's sort of what I'm talking about contrast this to Apple's M architecture M1 M2 M3 M4 those chips have on package memory but largely Apple hasn't taken advantage of the performance benefits that can be conferred from having on package memory it's mostly power savings that Apple has focused on and yet as I'm sure that the comment section will attest the folks that shelled out for the 96 gigabyt of memory and Beyond Apple laptops oh you might have overpaid there sorry all able to run large language models because of the unified memory architecture CPU memory GPU memory there's not really a distinction because it really physically is the same memory and so Apple's arm really was B basically able to just say hey uh let's just not move things around it's not terribly fast but it does work and it is generally a better experience on PC but it's a unified memory architecture nothing has to move between memory because the memory is literally the same and part of the reason the performance so good is because Apple's memory controller genuinely is really good it doesn't have a bottleneck and the caching system is quite good on the Apple arm M architecture and so it is a pretty reasonable experience even given that it's only lpddr5 and best case scenario you're not going to be anywhere near the bandwidth of like an hbm type solution so we're talking about memory bandwidth that is at best you know 25% less than what you would experience on the Nvidia CPU arm side of it now Intel's lunar Lake and amd's Hawk Point are trying to make this unified experience more feasible on PC mainly driven by npus like their neural Processing Unit it's still a work in in in in progress of course but part of the idea with resizable bar and making the mpu driver architecture based on wddm in the case of Intel is to try to better leverage access to data in the context of a GPU operation or CPU operation or npu operation to just make it a memory operation because it's all the same memory at the end of the day when we're talking about a portable system at least one that doesn't have a discreet GPU now let's let's let's bring that back to the Mi 300a it's x86 cores but the memory is on package and it's hbm 3E so the x86 cores are fed by this absurdly fast hbm 3E memory but x86 cores aren't really designed for that uh Patrick it served the home did a really nice expose on the Intel xeons that have hbm2e he got his hands on those and was able to do a deep dive and honestly I was expecting a lot more insane killer stuff to come from those it hasn't really materialized you could run those zons with only the hbm2e memory or with interner leaving or you could take over in software and have the software do caching and fun stuff I'm sure there's specialty software out there that now depends heavily on those CPUs but in terms of like the entire world is beating down your door to get those CPUs not so much and it would seem to be the case also with the Mi 300a now the Mi 300a does it seem the Mi 300a systems are disadvantaged with no dims there's no where to physically add memory I mean that might be a disadvantage but also cxl might bridge the gap and so this is a little bit of a background to say hey cxl could thread the needle for AMD customers and Nvidia customers both now Nvidia is not going to like that cxl can thread the needle and so this is the part you should pay attention to but I think that cxl as a way of expanding memory Beyond hbm3 could also be an interesting thing for AMD to do strategically like if AMD enabled better cxl functionality it could also kneecap Nvidia in a way that they frown upon let me explain cxl or compute express link has been in development for a while it's it's it's really cool in that it enables whole new branches of computer science for managing information and continuity in the case of of failure you can have a cxl device shared among multiple physical hosts the database people were really excited about this for being able to use cxl devices for assd compliance and transaction log uh save and and replay and it really does unlock some new computer science there because you're assured that the transactions make it off of system to another system but it's also low latency high speed high coherency and amd's done a lot of strategic Acquisitions for interconnect companies fpgas and you know pensando and all of this goes together but is getting a little beyond the scope of what I want to talk about in this video just know the cxl is highp speed in terms of it looks like a pcie Gen 5 device it's the the latency of pcie Gen 5 and so it's highs speed but when I say high speed you should think of high speed in the context of that Grace Hopper diagram where the numbers on on the left you're like okay but then you got three terabytes per second on the right it's like that's a whole other ball game in terms of speed what if there existed a cxl device that would allow you to expand the hbm memory okay yeah I mean the pcie speed is basically the same as a memory controllers you're not really at a terrible speed disadvantage when we're talking about Native ddr5 memory channels you're just at a dis disadvantage when we talk about hbm2e what if this is a way for NVIDIA customers to expand their vram that could be interesting that was another thing that I looked for at computex now keep in mind that Nvidia offers different configurations of Grace Hopper you can get dual Hopper no CPUs you know but it seems like in nvidia's architecture they're saying yeah 500 gabt of you know lowcost ddr5 to just shy of 100 gigabytes of of hbm 3E that's the good ratio we can feed it like that set of memory tiering we'll handle that in software developers won't have to worry about it it's going to make a lot of sense if Nvidia can get done in 96 GB of hbm 3E what will require AMD to use 128 or 288 gab of hbm 3E that is actually a competitive advant for NVIDIA but I think the reality is that the customers especially the open compute customers just want an absurd amount of the fastest memory you can get because all other variables don't matter that much which is interesting now the one shining star at computex in this thread of thoughts rattling around my brain was fison and their adaptive solution this is not Dam and it's not cxl they're using EnV pcie over nvme gpus can talk directly to nvme some of the plumbing and stuff is there for that on the Nvidia side as well and with fison adaptive solution they're able to take single level cell nand Flash and basically use it as a second tier of vram even with Nvidia based gpus so if you need to do a job that requires a terabyte of vram but you've only got four A1 100s that have you know 320 gigabytes of of memory you can make up the difference by throwing in some fison high performance nvmes and fison software and the software will be like you know ptor or whatever will be able to run and get that done like as if you have over a terabyte of vram this is a really interesting product so it can be done the other thing to look for at an operating system level is does your operating system have a concept of things like you have fast memory and slow memory and just a couple of years ago the answer to that was no but the Linux kernel has actually gotten support for more of these memory performance tiar not only that they've gotten really good support for things like Numa noes that don't have local compute but only have memory or that have local compute but no memory because that was the thing with 2000 series thread reper that's that's that's a video for another day but the fact that Linux support for that has become first class tells me that there are enough customers out there that are working on this kind of thing that yeah we could actually see systems that take more advantage of fast hbm 3E and slow ddr5 presumably Nvidia has that in their software stack but is proprietary probably and not open source and blah blah blah so it's really nice to see that on the Linux side because people will be able to take better advantage of that that also means the cxl devices plus an MI 300a should just work out of the box because those Zen cores should support cxl maybe it's an IOD die kind of a thing but if there was a you know one tbte ddr4 cxl module out there then boom that may be you're you're immediately good to go on your Mi 300a based system and you don't have any of the uh disadvantages of being locked into only hbm 3E memory don't know I think also based on the comments that I heard at compex with Nvidia spending so much money with so many partners I think those if Nvidia says M I don't know if we want to qualify a cxl device on an Nvidia based solution sorry um that could negatively impact how quickly we see a product like that come to Market that's Nvidia compatible meanwhile it's an amd's advantage to bring such a product to Market not only for the Mi 300a if they're interested in the Mi 300a but also to ensure that less Nvidia gpus get old possibly maybe in some parallel universe I mean okay possibly the other variable here is open compute so open compute is you know meta Facebook Amazon they dictate the hardware because all the hardware companies were playing games with how you were selling lots of Hardware to hyperscalers and so they've gotten to the point where they specified down to the chassis their Engineers are very good they're doing a lot of optimization the thing that I'm worried about is that I haven't seen quite as much uh involvement of open compute on the software side of it I'm sure that the software Engineers are are Whispering into the hardware Engineers ears no doubt about that but if you look at the meeting minutes and some of the open compute gett togethers uh Google and meta and Amazon are very very uh cards close to the chest when they're talking about their software stack and what their software needs in terms of enablement I'm sure that they share with Partners like AMD hey this is the software they we're working on under the hood under an N of course but that is not something that is really super openly discussed in those open compute meetings and I think that's putting the open compute members at a disadvantage because you possibly Miss Solutions like oh yeah we could just use cxl to to expand their vram space maybe or you know some other pcie type storage to expand their vram space maybe and solve some problems that we have immediately versus waiting for the next Hardware engineering cycle we also haven't really talked about how things that are far away increase latency and latency management also enters the equation here I mean if you look at the physical design of nvidia's racks they're putting all the fabric communication gear in the middle of the rack that's not for aesthetic reasons that's because we need to minimize the uh the signal propagation delay problems because speed of light it's real so yeah uh latency also factors in here and making the fabric as low latency as possible and there's there's some stuff that goes with that cxl has some stuff in it for hiding latency and doing coherency after the fact which is again black magic in terms of computer science but uh it's exciting because it's also a new computer science so anyway to wrap this up this is something that I wanted to be on your radar unified system architecture but also like software that can properly leverage fast and slow memory is a thing and software that can properly leverage fast slow memory unlocks a whole new set of possibilities without having to have a terabyte of hbm 3E on every GPU I don't think we really need that cuz we can't process that that quickly but this kind of is an inflection point and the inflection point is happening on mobile because unified system architecture on mobile saves power and just makes sense and so it's weird because For the First Time instead of data center Technologies trickling down to the consumer there may be some consumer software technologies that Trickle up to the developers running in the data center at least in terms of taking full advantage of a unified system architecture and then maybe it would make sense to have systems running on hbm 3E because right now hbm 3E in like a notebook you would expect that to be amazing but it's just not it's just not in the same way that those zons with hbm2e was not really amazing for most things it's just not it's just not most people don't realize that Nvidia has been working double plus overtime to ensure the developers don't actually feel the changes under the hood that exist with Blackwell and their next Generation platform that also ties into that promise that I was talking like line go up as long as Nvidia makes all of this transparent to the end user and that is 90% of the reason that Nvidia has gone up the way that it has their software moat if they fumble this in the slightest they're going to lose some of their grip on their development ecosystem because under the hood they had to make some changes in order to better address latency and scalability and everything else and Nvidia sees the solution as lots of GPU in a lot in a distributed pool with very low latency and so you get the best of all worlds in terms of an an enormous hbm 3E memory space because you don't have information duplicated across gpus and and everything else outside of Nvidia in the broader software ecosystem I think the rest of the players do still need to to play a little bit of catchup here and I think part of that is because the open compute players are not wanting to um work together all that much on the software because they want to maintain their their individual competitive Advantage they're they're really just not eager to share their software special sauce and this maybe puts AMD and other participants in open compute at a disadvantage because does it mean that that AMD has to solve these software problems and does it have to solve these software problems in a way that thread the needle that is not going to piss off their Enterprise customers I mean the Mii 300a could be the most amazing interesting way to run their Ai workloads and AMD has gotten some major wins through open compute with chat GPT I mean you you realize the chat GPT is running on Instinct systems from from AMD I honestly don't know if cxl would help with Mi 300a adoption it seems like Mi 300a should be you know something that's really Innovative but we didn't hear anything about it at computex and it could be that the Mi 325x or you know the Mi 300 successors are just going to overshadow the Mi 300a I mean nvidia's design with Grace Hopper is basically the traditional we have a a crap load of of ddr5 on tied to the CPU and we've got the the really high speed hbm 3E and the CPU is going to have to move things in and out and that's just the overhead that we're going to have to deal with maybe that's what it looks like in the real world but consumer devices that have one block I mean you're it doesn't make sense to have a laptop with 32 GB of dam and 8 gabt of vram like you should just put all that together and the operating system should look at it and say okay I should use the highs speed vram for the game the person is playing and when the person is not playing a game then we can use the high-speed vram for the operating system like that should just be part of the hardware architecture but that is way way easier said than done but it sure does look like consumer devices are going to drive some of that Innovation that may trickle back up to the Enterprise and make things a little easier on the Enterprise when we step back a little bit and we talk about you know disaggregated hardware and you have a pool of compute and is it CPU type compute is it GPU type compute what is your what are your containers needs I think my next step there's a certain company out there that is doing some really wild stuff in terms of putting a pool of x86 resources together in a highspeed cluster how about a server without a bios because the BIOS is too slow turn the CPU on feed it the agiza add it to the network and done that's that's a future after my own heart and likely a future video I've got to get out there to go see them soon and I think that some of the threads that I've started to pull on in this video uh we can pull on in some different ways in that video so I'm trying to work that out and be sure to look for that but yeah just some thoughts on where was the Mi 300a at computex but also how is NVIDIA presenting to developers something that's like a unified architecture but under the hood is in fact not and and is that a portant of things to come for our unified arm architecture future where the GPU does have fast and slow tiers of memory I don't know I'm one this level one I'm signing out you find me in the level one forms\n"

AI and Unified Memory Architecture - Is it in the Hopper Is it Long on Promise, Short on Delivery

Random Videos