Milan X - 768mb L3, What's it Good For

**The Future of Server Performance: AMD's 64-Core Processor**

In recent years, server processors have been gaining significant attention due to their increasing power and performance capabilities. The latest development in this field is AMD's 64-core processor, which boasts an impressive array of features that promise to revolutionize the way we approach computing.

The new processor features a large cache, specifically the L3 cache, which has been optimized for high-performance computing applications. With its massive storage capacity of 4.5 gigabytes, this cache is designed to handle complex computational tasks with ease. This upgrade not only increases the clock speed but also provides a significant boost in performance. The processor's 4.1 gigahertz clock speed, combined with its large cache, makes it an ideal choice for server applications that require high processing power.

The impact of this new processor on server performance cannot be overstated. As demonstrated by AMD, doubling the number of processors in a cluster can more than double performance, thanks to the dramatic increase in cache memory available across the entire system. This means that complex computational tasks can be broken down into smaller chunks and processed efficiently, resulting in significant speed-ups.

**Testing and Results**

AMD conducted extensive testing on their 64-core processor, evaluating its performance in various server applications. The results were impressive, with some tests showing a 50% speed-up or even better for certain workloads. This is particularly notable when dealing with database workloads, where the availability of a large cache can significantly improve query performance.

The testing also revealed that not all workloads benefit equally from the increased cache capacity. In some cases, smaller caches may be sufficient, but for those that require significant amounts of storage, the new processor's 4.5 gigabytes of L3 cache provides a substantial advantage.

**The Role of Compiler and Software Optimization**

One of the most exciting aspects of this new processor is its potential to unlock performance gains through optimized compiler and software algorithms. The increased cache capacity provides a unique opportunity for developers to create more efficient code that can take full advantage of the available storage.

In fact, AMD has already begun working with compiler vendors to optimize their algorithms for this new processor. This means that applications written specifically for these processors will see significant performance gains, further solidifying the benefits of this technology.

**Windows Server 2022 and Other Operating Systems**

While AMD's 64-core processor is designed for server workloads, its impact extends beyond just operating system compatibility. In fact, Windows Server 2022 has already been tested on this processor, with impressive results. This demonstrates that even modern operating systems can benefit from the increased performance capabilities of these processors.

However, it's worth noting that not all operating systems are created equal when it comes to optimizing for high-performance computing applications. AMD has indicated that some operating systems may struggle to keep up with the demands of this new processor, at least in certain scenarios.

**The Future of High-Performance Computing**

As we move forward, it's clear that the next generation of server processors will prioritize performance and efficiency above all else. The development of processors like AMD's 64-core model marks a significant shift towards more powerful and capable computing systems.

For developers, this means that they will have access to more efficient code optimization tools and techniques, allowing them to create applications that take full advantage of the available storage capacity. For users, it means faster performance and improved responsiveness in demanding server workloads.

**Conclusion**

In conclusion, AMD's 64-core processor represents a significant leap forward in high-performance computing capabilities. With its massive cache capacity, impressive clock speed, and potential for compiler and software optimization, this processor is poised to revolutionize the way we approach computing. As developers, users, and enthusiasts, we can't wait to see what the future holds for these processors and how they will shape the world of computing.

"WEBVTTKind: captionsLanguage: enso much lab happening that's milan x it has 1.5 gigabytes of l3 cache and it's running all my benchmarks this is the game changer it is completely completely bananas and be epic indeed this is epic in this corner weighing in at 128 cores 256 threads we have the gigabyte mz72hb0 and in this corner we have amd's test platform the daytona dual socket server get some red lights and a 1.5 gigs of l3 cache half a terabyte of memory let's put the top back on before it gets angry currently doing windows testing because of reasons but this is happening it's happening right now so so very exciting i got to spend a full day with amd through austin campus not just talking with people from amd amd actually brought in others that have been exposed to milan x for weeks and months i got the impression we met several different people in the industry that worked in different aspects that will benefit from the changes here to milan x now to be sure the difference in the core is none from milan the 64 core part to milan x the chords themselves are exactly the same what we're talking about with the milan x launch today is the addition of more l3 cache it's milan processors but with more l3 cache these are coming a little bit later in the product cycle but these are the exact same milan cores it's the exact same milan silicon amd's got all these tricks up their sleeve they do fun interesting things with the iodine it's like oh reverse the io die now we've got our our chip set and blah blah blah so it turns out that the connections here for more l3 cache have been here since epic rome but it took that long in the software cycle to sort of catch up and sort of rethink how we do things from a software standpoint in order to really take advantage of this so we moved from 256 megs of l3 if we're talking about an epic cpu that has eight chiplets to 768 mags an additional 64 megabytes per chiplet that means our dual socket system that we're testing yeah that's right 1.5 gigabytes of l3 cache and right at the you know right out the gate it's like wait this is the system that can do over a hundred thousand in cinebench r23 why don't we run that again and see what we can get out of cinema r23 bad news it's still a hundred thousand the performance is almost identical in cinebench why is that it's casino bench mostly already lives on the cpu it doesn't have to communicate with memory l3 cache comes in when the working set is so large that it won't fit in the cache on the cpu mulan already has a monstrous 256 megabyte l3 when we're talking about you know access to crawl across all eight chiplets this is taking it to the next level cinebench doesn't really help so we met industry experts talking about fluid dynamics and simulation cfd computational fluid dynamics that's kind of a big deal in the industry those folks are willing to spend a lot of money the first company we heard from was ansys ansys they're they're a huge company in engineering you know lucid motors in their design well guess who's working with them they've been evaluating milan x for a number of months now and including the aocl bliss library now this is a math library basically but you can do certain fun things with math libraries and math algorithms when you've got so much cash so they're seeing a ten percent to over a hundred percent speed up depending on the size of the problem and other parameters that go into it one of the test problems they shared was a three-car collision now this is a physics simulation mechanical simulation of three cars colliding and they want to look at how the frames deform and this is a real truth like if you make this out of metal you do the simulation this is what we're talking about here engineering simulation these big gnarly problems and ansys provided a ton of documentation and a ton of examples and places where it's like okay this job really didn't see a tremendous speed up but this other job was more than twice as fast just increasing the amount of cash that was available to the processes running on the cluster this is a really insightful presentation from ansys you know siemens for example i'm joshua strodebeck i am the technical product manager for high performance computing for the sim center star ccm plus product at siemens so when we talk about a 1 billion cell simulation you have to remember that sim center star ccm plus has a fundamentally scalable high performance computing type of architecture that means when you have a large problem you can chop it up into small pieces and run it on huge numbers of cores so for a billion cells you might see a user run on as many as thirty thousand sometimes even more cores we typically recommend about two gigabytes per million cells of memory so these are often very memory intensive simulations as well and overall you're going to see from our users a demand for lots of processing power lots of memory and lots of bandwidth this is just going to hit the computer about as hard as it can be hit 30 000 cores on your siemens cluster is not unusual especially when you're doing large simulations sometimes your large simulation might have a billion solids that it has to compute and this is data that's from the real world so it doesn't really fit neatly necessarily into two and three dimensional vectors and so the access pattern if you're trying to you know it's like okay is this row wise is this column wise what do i need to do to optimize optimize my algorithm you're going to have a little trouble because it's real world we're doing real simulation one of the examples that they showed seem to be some type of washing machine and it's just like we want to you know model the movement of water inside an empty washing machine where it's just you know doing the the the temperature of the water and the dynamics and it's moving things around and it's sort of modeling the water at least three-dimensional it turns out the access pattern for that is is not great it's really random and so depending on how much speed up you get has to do with how complex the models are and then there's this thing that we heard about called super linearity which is you know if you go from four nodes in a cluster to eight nodes in a cluster your best case scenario is that you cut the time in half with mulan x that's out the window you can actually get better than half you get a better speed up when you resize the problem so that more of the problem is able to live in the processor cache if you have a million solid thing that you're simulating for example on a single system you can see a 1.8 x speed up because things are living inside the cache just when comparing milan to mulan x having that much cache available means that you don't have to go to main memory as much even for these weird access patterns that are very difficult to predict and so that's a pretty amazing speed up similarly by the same token you know the sword cuts both ways if you have an absurdly complex simulation that has say 10 million solids the performance uplift is only about 10 percent moving from 256 megs of l3 cash to three quarters of a gigabyte 10 speed up might not sound like a lot but for these jobs running 30 000 cores you know you run something like that overnight you're going to save an hour off of that job but more importantly when you do have a cluster that has 30 000 nodes in it if the thing is scheduling the jobs on the cluster can be just a little bit smart about it and estimate and say okay we're going to break this down into problems we know that problems that are around a million nodes or so run really well on these processors and if we if we break it up too small then we're our performance is dominated by the communication overhead between nodes and if we leave the job too big then we don't get that super linearity so if we pick the job and slice it up into slices that are just the right size we can get 1.5 x speed up or 1.6 x speed up or 1.7 or 1.8x speedups just from having the large cash because we've picked an appropriate size job that can live in those processors and so then the question becomes what's the cost delta on that well from a licensing perspective you know patrick from serve the home was there to ask all the licensing questions none of the vendors indicated that there's going to be any licensing or charge differences between milan and milan x it's the same number of cores it's the same number of accesses the same amount of compute raw compute you just get more cash so if you can get a 1.8 x speed up from that no problem what about the cost of the cpu then it's an extra 800 per socket a trivial cost so then the question becomes are there any regressions no there's not really any regressions you can actually even in the bios of your server disable the extra l3 cache if by some miracle you trip over a bug or something where the software doesn't expect you to have three quarters of a gigabyte of cache you can disable it now the 7773 does turbo a little lower and it does clock a little lower owing to the different thermals and i was able to observe that on our test system when testing under linux it'll use a little bit more power because we've got another 64 megs but the amount of power that uses seems to be negligible so we're talking about maybe 100 megahertz difference real world when we're looking at these kinds of devops jobs that i'm running which we'll get to in a minute but uh yeah yeah it's uh for common computational things computational fluid dynamics not just siemens also it also turns out microprocessor simulation is another one uh you know amd wanted to show the performance uplift for doing you know processor simulation and they said well we have the zen four core let's show you the simulation of the zen four core psych just kidding no the performance uplift is there and we can see it in the jobs like they log me into the terminal and we're like look at the differences on the terminal and it's like i believe you but i didn't actually get to see any registers or flip flops or anything in the in the zen four core and that's that's not exactly something that they're they're gonna give you know rando youtubers it's fine it's totally fine hey i'm phil steinke i'm a fellow in our central methodology and tools department at amd and one of the main areas i specialize in is looking at the eda workloads that we use to design chips and how they run on our epic platform so we did an eda verification on that milan versus mulan x monster cpu versus monster cpu with one and a half gigabytes of ram in the dual socket configuration yeah it's faster and this is verifying the actual silicon layout stuff the eda for the 6900 xt gpu amd's own so amd's definitely dog fooding their own products here we grabbed one of our big gnarly simulations a little while ago this is simulating the graphics core on our flagship radeon 6900 xt graphics product and it's a triangle setup and simple compute operation across the entire graphics course so it's a big design it's exercising the whole graphics core it's a really hefty simulation that really pushes the tool so we use that to benchmark our simulation performance when we're taking a look at different cpu models and tweaking out our server configs to make sure that we get the most out of it running that benchmark sim on our regular epic 7003 series cpus versus these new ones with the 3d v cache we see about a forty percent uplift when those jobs are just one to eight and each job has a whole chiplet to run on and an even higher uplift as the server gets loaded up to 66 on a fully loaded 16 core cpu and just to be clear that's a 66 performance improvement just swapping in the cpu from the regular 7003 series to these new ones with the 3d v cache and so managing the jobs on these clusters running the simulation you know somebody makes a change they might have a gazillion simulations a day so on the y diagram there's many aspects that go into making a chip in designing the function of the chip to get it right we have to simulate a lot that functional behavior because every little code change could introduce bugs or change how that's going to operate and to make sure the chip is doing what it's supposed to do that's a key piece that we run again and again to make sure our chip behavior is as expected we run thousands upon thousands of these simulations probably over a million daily and yeah they chew up at least 60 of our data center and so i really deeply appreciate having face time with real world users of these products another thing that also stood out to me was the common theme of amd wanted to talk to these software folks and they're saying yeah no amd came to us and said well let's take a look at your your jobs your workflow the things that you're running and see what kinds of things would help and it turns out one of those things was a massive massive l3 cache and yeah the proof is in the pudding the speed up is here hello my name is push patel i'm alter senior vice president for strategic partnerships alter is a software company focusing on simulation high performance computing and data analytics we've worked with amd to benchmark our applications on the latest epic processor and compared to the previous generation we're seeing speed ups from 1.3 to 1.8x when we evaluated milan x we didn't really have to do anything special to optimize our applications it was kind of a drop-in replacement for the previous generation milan and ultimately customers because of the way we do licensing there's no penalization for running faster in fact they get better utilization of the tokens and units and licenses that they already have the common theme here is you know amd is not creating a product and then saying okay now it's up to you to use the product effectively it is this thing's ready to go it is drop in ready you just plop these in a new socket you're good to go most surprisingly azure microsoft has their high performance azure instances boom overnight as fast as amd can get them sounded like that they were going to swap in milan x for all of their high performance computing nodes now these are if you're not familiar with azure these high performance nodes they're on a 200 gigabit infiniband they've got insane node to node communication and that's definitely not something that i've had a good experience with on other cloud providers in terms of ultra ultra ultra high speed connections microsoft really does have that down for the high performance nodes in their cluster so that's a that's a whole other separate conversation but they're going to take all their milan cpus put those in another sku and all their high performance nodes are going to milan x because of the infiniband because of the cash because it changes the game on the metric on allocating time on these high performance compute clusters and you know on on those you can get i think up to 120 cores a single instance so some of the cores are reserved for you know cluster bookkeeping and that kind of thing but that's still that's pretty nuts now for our own testing at level one devops devops is the name of the game i've got a project right now that i'm working on with greg crow hartman where want to speed up the kernel build as fast as possible and in this scenario building a single kernel the extra cash doesn't really help sorry it's just like the kernel the linux kernel all things considered versus simulating a washing machine i'm sorry the linux kernel just not that complicated even when you're building all the modules now if we need to build 12 or 15 kernels at once okay now we're starting to get somewhere now we can start to see a little bit of a performance uplift we can move from 30 kernels an hour to 34 35 kernels an hour we can get a little bit better performance but only when we're running lots of compile jobs in parallel now for something like open embedded if we want to build all of open embedded from scratch similarly it didn't really help a ton now where it did help a ton at least it seemed to was when we were running tons and tons and tons of containers now containers is a little different scenario the kernel has specific stuff in it for containers so running on the red hat platform and doing some stuff with openshift it sure seemed like having the extra l3 cache really helped us a lot when we're standing up this cluster we reserve two cores for the full open shift you know cluster system and then we just sort of let it go to town it's it's initializing ssds there's actually 24 nvme and two p5800x ssds in the system that are configured mostly for ceph storage so the cluster looks at it and says oh that's actually local and there's a there's a patch and a thing that goes that goes through that to figure that out and greg from red hat figured all that out but in this setup when we're spooling this thing up it's not unusual to see as the cluster comes up the load average be about 600 and even though the load average is 600 the system is actually still surprisingly responsive and still you're able to log in you're still able to connect with ssh you're still able to do stuff part of that's because we reserved some cores part of that is because the operating system structures are not constantly falling out of the l3 cache because the system is so busy doing other things so potentially the enormous l3 cache has a huge benefit now i'm sure pharonix also has a ton of benchmarks for things on linux in particular i'm still working on a lot of my heavier devops benchmarks but if you have some specific benchmarks you want me to run definitely let me know in the level 1 forum or if you want to you know work on a larger project where i can get you connected and we can try to do a couple of things i would be glad to do that i think that'll actually be pretty interesting stuff so how do you know if your job or your workload or what you're doing is going to benefit from the monster cash in milan x i'm so glad you asked if you're already on amd the performance counters are exposed to you through utility from amd called microprof you can download that and generate graphs similar to these that show the hit rate this shows an effective you know one point something ish instructions per clock to about two instructions per clock moving to the v cash milan cpu for that circuit simulation job yeah that's the kind of uplift that we're talking about when we talk about 1.8x this is a much better l3 cash hit rate if you're on an intel platform and you're not sure what your cash hit rate is intel has a utility called v tune you can use that but there's also a sort of famous open source developer brendan gregg he's written performance and uh and other sort of performance related books about linux and the types of performance counters that are in linux check out his website in the scripts what we're using here is really not super modified from what he has there we're just we're just plugging in the micro profit utility for that you can check that out i'm sure we're going to have future videos on this so that you can profile your job now if your job has a really high cash miss rate having more cash doesn't necessarily imply that your hit rate is going to be better but if you've already got a really good cash hit rate showing that most of your stuff is in cash then you're probably not going to benefit from more cash so you can at least learn some things you're going to have to make an educated guess but these are industry standard tools and it's nice to see that there's not a lot of black magic or smoke and mirrors going on here so at the end of the day it really is going to depend on your workload and you're going to have to do a little bit of leg work to know if the job that you have will benefit tremendously from between 256 and 768 megabytes of l3 cash per socket because that is what you need to know if your job already reasonably fits in the cash that you have then this is probably not going to benefit you and that's why amd was pretty careful to say this is not really a new flagship cpu this is not intended to replace anything this is meant to help solve the problem for one particular segment of the industry you know we have frequency optimized uh parts uh that have a higher clock speed maybe a lower core count 32 cores 4 gigahertz 8 cores and 4.1 gigahertz those are frequency optimized epic skus that are available they're not intended to replace or supplant anything there are customers that need the highest possible clock speed and they'll use that there are customers that previously were using you know souped up desktop machines because what they were running just needed clock speed clock speed was all it cared about but now now you know those eight core epics 4.1 gigahertz with the large cache that's pretty good stuff in some of the testing that we did every 16 megabytes of l3 cache was basically given for a 100 megahertz equivalent uh clock bump so moving up to 768 megs of cash puts us in that five gigahertz range for certain algorithms similar to the frequency optimized skus these are going to be available in more than the 64 core monster there's also the 32 core the 24 core and a 16 core that will be available the 16 core especially if you have a database workload where you would benefit from having a tremendous amount of l3 cache to cache gobs and gobs of memory for your database access and your database access pattern fits into a pattern that can be predicted by the predictor and the prefetcher that will leverage that correctly then this could be a great way to dramatically speed up your sql instance mr glenn berry microsoft mvp and microsoft sql expert will be helping me do that testing on windows but i've also got additional testing for mysql and postgresql it's good stuff and early indications on that testing show pretty much the same thing can i create a database that will not benefit from an l3 cache and be slow the answer to that question is yes most assuredly i can do that but if i do a little bit of work and i have a really hot database that's got lots of transactions if i structure it correctly the difference in l3 cache can be good for a 50 percent speed up or 25 percent speed up it's going to depend on the workload so this is a pretty exciting time and as i talk about this know that you know the compiler people haven't stepped in to say oh let's do this other algorithm if we have 1.5 gigs of cash and other low-level software people in operating system kernels etc i mean even windows server 2022 which is what we did a lot of our testing on it doesn't always know what to do with 128 cores in a box it's really kind of shocking it does in some scenarios it does when you're dealing with virtualization but other stuff maybe that's my own bias talking but i just does some things that make me scratch my head but 4.5 gigabytes of l3 and 1.5 gigabytes of l3 mean that there are scenarios that you can create that will result in a super linear speed up and that's where doubling the number of processors in your cluster more than doubles performance and the reason for that is because every socket that you add that you go from 256 megs to 768 megs of l3 cache you are dramatically increasing the amount of cache memory in the entire cluster as a whole so if you slice the job up correctly then the job will mostly live in that ultra fast l3 memory and certainly from the people that we interviewed at amd the computational fluid dynamics people they are all more than willing to do that but so far they haven't had to do any work to be able to do that the algorithms and their cfd packages can take advantage of it the amd math libraries like bliss for example they're already implementing those they're already implementing the rock m stack and all of the stuff that happens in the processor in those libraries have already been updated to take advantage of the monstrous l3 cache so they automatically get performance uplift just by changing the cpu in the socket not really doing anything herculean because those algorithms would already benefit from that cache so that's pretty exciting it's pretty exciting to see amd move the state of the art forward with these processors i really wouldn't be surprised if from here on out that when new processors launch there's always going to be a large l3 cache sku certainly in the market i've seen you know off-road map cpus that change and tweak the cache parameters we've certainly seen processors that have an l4 cache another layer of caching to try to address some of these edge cases but this is the first real serious earnest push in server that we've seen like this in probably 10 years to really get super ultra high speed memory right at the processor right where these algorithms need it most and that's pretty exciting i'm willing to level one again big thanks to amd for having me down to austin to meet some folks in the industry and share some info from from their brain dump i learned a lot i made some awesome contacts and i can't wait to do some more future content based on this processor this performance and some of the stuff that i learned there i'm wendell this is level one i'm signing out and you can find me in the level one forums you\n"