BigTech - $250k of Persistent Memory w_Allyn

The World of High-Performance Computing: A Conversation with Wendell and Alan

As we sat down to talk about the latest advancements in high-performance computing, I couldn't help but feel a sense of awe at the sheer scale of some of these systems. We're talking 100 gigabytes per second worth of bandwidth moving around between two parts of the system, and even that's not unusual in today's world. But what really caught my attention was when Wendell mentioned that this kind of processing is basically more like a peripheral than a memory device. "Yeah, you had to use the memory bus for the peripheral because the PCIe bus was too slow," he explained. "Or there wasn't enough of it." This highlights just how much bandwidth we're talking about here - 16 SSDs worth of storage, all connected through multiple interfaces.

One of the most fascinating aspects of this technology is the way that it handles memory coherency across processes. Wendell noted that even when you think you're writing something to one place in memory, that thing might have been cached somewhere else. "So did you let everybody know that had that cache?" he asked. This raises all sorts of interesting questions about how these systems manage data storage and retrieval. It's not just a matter of throwing more RAM at the problem - there are entire architectures dedicated to handling this kind of coordination.

But what really gets me excited is the sheer scale of some of these systems. Wendell mentioned that even if you're doing something on one socket, it's not actually local - there's extra logic that duplicates some of that memory content across both sockets, just to make sure that in case one of those threads happens to go over to the other socket, you want the data to be at least local. This raises all sorts of questions about trust and coherence in these systems - which thread do you trust? And how do you ensure that it's not a problem?

Of course, this kind of technology isn't just for enthusiasts like Wendell and me. There are entire companies that build high-performance computing systems for Fortune 500 companies, and the hardware is still largely custom-built. "I mean, even if you're doing something where you think it's just on one socket," Wendell noted, "it's not actually local." This kind of complexity means that there's a lot of extra work involved in getting these systems to function properly - but also a lot of potential for performance and efficiency gains.

As we talked, I couldn't help but feel a sense of wonder at the sheer scope of some of this technology. We're talking 24 racks of servers, each one worth quarter million dollars. This isn't just a toy or a hobby - it's real-world computing that's going to shape the future of industries like finance and healthcare. And yet, despite all its complexity and scale, there are still moments of simplicity and beauty to be found in this world.

And speaking of which, we've got some exciting news for our readers: Wendell has been playing with some pre-release hardware from one of these companies, and he's willing to share his experience with us. So stay tuned for more updates on PCI Express 5 on Alder Lake - it might be the first look at this technology since CES was cancelled. And who knows? Maybe we'll even get a glimpse of what this means for the future of high-performance computing.

In the meantime, let's take a closer look at some of the specifics of Wendell's system. He mentioned that 16 dimms are connected through multiple interfaces - how does that work? And what exactly is PCI Express 5 on Alder Lake? We'll be exploring all these questions and more in our next video.

As I wrap up this article, I have to say that I'm still in awe of the sheer scale and complexity of some of these systems. It's not just about throwing more hardware at a problem - it's about understanding the underlying architecture and how it all works together. And yet, despite all its complexity, there are moments of beauty and wonder to be found here. Whether you're an enthusiast or a professional, this world of high-performance computing is definitely worth exploring.

"WEBVTTKind: captionsLanguage: eni have on my desk a quarter of a million dollar server which of course gets an introduction before mr allen from intel you brought a very very expensive server uh i did i figured uh you know the whole ces thing was a little bit of a bust this year we're having many ces at level one yeah i'll just i'll just bring bring the ces to wendell wendell doesn't have to come to the ces this was literally gonna be one of the show floor demos with the four terabytes of persistent memory i yeah that's not even that's not even the crazy part 88 pci express lanes to the front yeah there's no switch they're all wired there's no switches in here they're all just hbas that are one-to-one just re-timer chips and so like it's literally this is as i don't know i don't even know the right way to describe it it's just bonkers it's like every pot almost every possible pcie lane from every possible connection point on this board like there's i don't know there's these slim size connectors that go into like dual four lanes into a single eight lane slim sass just to make these happen and then there's four of those so there's like 32 lanes just from the front and that's what that's what would normally go to like the front panel nvme of a standard build do you know how hard it is to get those cables it's impossible you can't even look these up i've tried looking up some of these part numbers yeah you can't even order them right now and then so it's like well that's not enough okay well then there's you know there's we'll just add these three retimer cards so that's like another 48 lanes because each retimer has two of the slim sas connectors and then those are all going up there and then that's not enough that only adds up to 80. so i had to like i couldn't figure out like for a while there's eight more where's the where's the like i'm doing the math and i'm looking and no there's snuck in here in the back corner and we'll get some b-roll in there there's another slim sass that just happens to be off the board here and then that's routed to another spot over there so this is a super micro chassis yeah and this is designed for very very specialty applications you are not going to run to your cio and say hey we need to buy this this is a this is a server that requires different computer science pretty much yeah especially if you're doing the pmem config now the pmem config could apply to you know you don't need the bunker's front panel stuff to do the pman piece of this but this is just sort of everything all in one yeah chassis right well the thing so computers were never all the software that we have now is not designed for persistent memory and the thing that our viewers probably may not really be super clued into is that because the memory is not persistent we actually do a lot of herculean things in software at least important software right to try to make sure that we don't lose data and so this is the perfect system you're thinking for running a database system this is pretty much you know you're looking at oracle as pretty much your only option for software that is actually validated for this kind of a configuration there are patches for postgresql don't get me wrong but you run into scalability problems with databases because databases assume that they're going to lose information that's in memory a microsoft sql server and oracle and a couple other commercial database systems do have a provision for persistent memory but it completely changes the performance landscape if the database server can count on the memory not losing information then we're talking orders of magnitude speed up in the operation of the database server and that's why somebody would pay a quarter of a million dollars for this much persistent memory right and especially if just the persistent memory level like each one of these sticks will do you know what your your obtain your crazy obtain drives like you've got the p5800 xp 1500 yeah well i mean you got a couple of them and that's that's you know 3.6 terabytes those but those devices your your latencies are you know lower than 10 microseconds which is impressive and actually dip down into like the sixes and the sevens when you depending on if you're in linux and if you're doing actual you know good workloads that can take advantage of it uh the pmm sticks you're down to like point one to point three a lot more bandwidth in other words well that's for latencies yeah right so you can get relatively high bandwidth worth of very small requests that are just peppering the device right so that's handy but then if you do hit it in a straight line each one of those dims will do over eight gig per second and there's 16 of them yeah so just in raw bandwidth i mean yeah it's not you would think that the bandwidth would be uh higher because it's plugged into a dimm slot and you'd be like oh yeah that should go way faster than that but then you're just uh limited at the device level yeah really for just how much bandwidth can you get from the octane chips on the you know on the part itself and this platform is so bleeding edge that if you wanted to do a mixed configuration mixed persistent memory and mixed optane that's not necessarily going to work either you're going to have to consult an expert to help you configure that system to make sure that you're not basically specifying an invalid configuration right yeah so this is 200 series pmem which is the newer newer pmm style that's what you would use for ice lake and then speaking of bunkers config so this config is i don't even think the cpus are supposed to be in this config because they're 8368qs which i think are supposed to be water cooled oh yeah or they're meant to be uh but you know i'm bending the rules a little because i can i guess um 400 watts no they only they still only tdp at 270 watts a piece uh but they're bent they're like three bins higher so they're you know 300 megahertz higher uh clock for boost yeah um so you could you sort of make up the difference because these are only 38 core instead of 40. so you have the crazy bunkers 40 core yeah i have the 8380s i would much rather have these cpus just because the 8380s the 380s are great for a multi-core workload and they're they're you know a lot of the hypervisors i mean intel i think has been working with hypervisors because you get some of the i've got some of the sky lake and cascade lake uh cast offs from amazon and all amazon cares about is disable the c-states set the maximum you know all-core turbo and forget about it they literally don't want anything else and the 83 the 83 80s that is what they're built to do but these with the turbo where you have kind of a bursty workload it's a much better situation yeah you get a little more speed uh well actually even the all core is a little bit better but depending on your workload if your all-core workload is just hitting the 270 you're basically going to get the same overall speed you know regardless of which part it is right um but for any of those workloads that are easier on the cores where you know you can stick within your tdp budget and still get to a higher turbo these will go to a slightly higher turbo which is handy i've also noticed that with the most recent super micro bios and the super micro desktop board that i have with the 8380s yeah that they're unlocked basically they will turbo in you know uh oh you got that working until infinity oh yeah yeah as far as like tau yeah right yeah they'll no they'll just sit at 270 yeah all the time yeah yeah they'll say yeah yeah as long as as long as you can you know keep them cool which is not it's normally not too hard especially you've got the knock tools on your on your crazy desktop system a little harder on this uh it just has to be louder we'll see that in a minute yeah it does keep them cool it does keep them you know 60c i don't think they get much past xdc even in this build but the fans are just totally screaming yeah it's it's just a it's like a fire engine well we've also seen companies like keoxia with their 32 terabyte nvme that would be perfect for the u.2 slots you got 2.3 slots in the front of this case uh 0.3 the spec that lets you go to sas no that's still u.2. yeah these are all u.2 uh the in this config it's a little weird so it's we said 88 lanes you'd actually need more than 88 to get all 24 slots going full speed so the first two slots just i guess for shear hey this is bonkers enough with 88 lanes we don't need more so the first eight physical are sas and sata but not the first two slots or everything but the first two slots will do nvme okay right that makes sense because you usually will run a raid one for your os drive on a system like this right if you're not doing iscsi boot off the network right and you don't need your os drive necessarily sitting on nvme either right in many cases for server builds the sas slash sata is fine so you put a couple of sata disks on the first two slots and then your last 22 uh or if you wanted whatever that last 10 out of the first 12 you know you could do nvme or sas you know for those as well right it's just these three ribbons running up front and each of those three is a quad sata connector right so that gives you your first 12 like basically your first half of the bays can do sas sata and then 10 of those can do nvme yeah and that's not atypical for a chassis configuration like this yeah and it also gives you some full-size pcie slots so if you are running some types of accelerators yeah you do have you do have some left there's not many but there's there's a few left over here you can you know if you well the other concern is but maybe you weren't doing everything with store it locally to the cpus storage and not that much bandwidth out of the chassis so if you need a higher bandwidth out of the chassis at least you do have some more more lanes you know to be able to do that you're gonna have to add some other neck that's we can put a melanox hundred gig nick in there that's not a problem i have some of those right yeah you might need to be a little careful on like okay when you get into that bleeding edge you're probably not messing like you're pneuma optimizing your workloads right to the point where you're probably going to want to make sure you have possibly two nicks in there one nick on each pneuma even yeah just for to make sure you're not having a bunch of crosstalk between both sockets well yeah that's another that's another point to talk about interesting use cases for this kind of a workload you know even if you were going for maximum density on some sort of database cluster your four terabytes of persistent memory in your storage probably would make more sense spread across multiple chassis yeah because we've for the the amount of storage that we have and the persistence and uh the way the storage architecture is and a whole bunch of the other parameters here this is uh workloads that are critical that no information in memory is lost or possibly crash analysis you know you would use this kind of hardware to develop the next generation of hardware because if there is a crash the system is frozen in the crashed state and you could dump the memory through any kind of other slow interface or through it debug interface and see exactly what was in system memory at the time the crash occurred which is hugely valuable if you're into debugging oh yeah we're still very much in the whole chicken in the egg sort of state with respect to pima right so as far as people developing like all these technologies are still evolving especially software yeah to be able to take advantage of this um things have changed so like these are the blue dimms which is which signifies it's 200 series so there was a 100 series that was the original black man picked some of those up on ebay from time to time yeah uh and and those were you know the prior gen this is ice lake so you had to be cascade lake uh or possibly skylake depending sky skylake was used for like the really beta system builds super early super early for the for that original pmem uh but the 100 series you could do uh different modes you can also do different modes here but the 100 series you could do uh what's called memory mode which is it just gives you a huge pool of what looks like dram right but it's not all dram it's actually pmem plus dram is the cache right uh that's like the easiest way out that's just i don't want to do any optimization i just give me a whole bunch of memory and you're not even really benefiting from the persistent ability of the pmem you're just benefiting from it's you know dims that are up to 512 gig in capacity which is just bonkers right for for way cheaper than you would have to pay for a dram of that capacity right um but again not like not the most optimal thing as far as you know taking the most advantage of the system right if you're trying to do a database server where you did want to be able to reboot the system and still have the data there you reboot this in memory mode it's gonna wipe the dims next time it boots right so you don't want that you want the other mode which is called app direct and for app direct that's the one where you're up you're actually optimizing your software to take advantage of the dimms the dems show up as a separate device they're not added to the memory it's just a really fast you know really fast storage that just happens to be connected to the same bus as the dram yeah right and the level of software engineering that intel's already invested in that uh i mean we're talking thousands of man years of development in order to provide a you know a reasonably robust library set so the developers could take advantage of that and that's one of those mysql patches that i was mentioning is that i think some of those uh some of the patches that i've read about anyway depend on the app direct mode for storage and so they will use uh they will use that for in-flight data and mutexes and locks and some of the database machinery but all the other stuff still lives in memory so like caches and things that you know it wouldn't really hurt anything if it was lost all of that still lives in memory but again that's way crazy computer science and so you may be thinking it's this level investment is absolutely bananas is the performance gain from having some fast persistent memory really worth it and the answer is yeah yeah if with the software was designed for persistent memory it would be insanely way faster than it is today because it doesn't have to worry about losing data yeah and that's the catch is that the software still to some degree has to catch up yeah right there's still even in even with respect to dax and some other things or where dax is basically just making uh the file system on the persistent memory so it still is kind of our workaround yeah right because it's still giving you a standard like block accessible you know formattable kind of storage device right that's just almost like a flat file system kind of thing um for some of the performance testing i've been doing where i wasn't that worried about the persistence but i wanted the memory to be running in appdirect and just how fast can i throw ios at it and things like that turns out dax is actually really slow because dax is written to try to make sure that the persistence is guaranteed yeah and that you know you've done a right and the rights are atomic right and you're just so that will slow things down and even even with respect to dax there's still optimizations to be done there for that to improve right i've told you the ride has completed and it has yes yes as opposed to what a normal nvme ssds actually kind of lie to the system yeah well they cheat a good one cheats so like yeah samsung 963 which is a fabulous fabulous enterprise-grade um ssd it has a dram cache and so it will tell the operating system hey the write has completed when all of the bundle of data is in dram but that's a lie i mean you would you'd be correct if you're reasoning it out of your mind and saying wait a minute if the nvme loses power just the wrong moment the is going to lose the data well the 963 is designed with a whole bunch of power capacitors and so it's literally got a mini standby power system in it to ensure that it has enough power to ride out what was in dram in 99.999 percent of scenarios to the flash so when it says the ride is completed it's in fourth dimensionally it's not lying but if it loses power there's a special handler in there for that that's the level of crazy computer engineering yeah that we're at and this is a whole other layer on top of that yeah but like client you know nvme nand no they don't have that it's still it still tells you it's done just if the data has made it onto the device yeah it's either sitting in ram or a lot of those especially with dry unless ssds now like dramas isn't actually there's no ram like there's some sram sitting in the controller and it's it's got a little bit of ram right but for the most part that data that came in basically just passed right through and went right to the to the nand chips themselves and even though it's not committed to the nand it's sitting it in like the input buffers yeah on the nand chips right it still takes some amount of time for the rights to actually complete and consumer drives if they don't have power loss protection capacitors on them or even in some cases if they do which almost all of them don't right uh yes it told the system that it was done it wasn't actually done so it'll lead to corruption when you reboot it's like yeah yeah it won't necessarily mean the drive is bricked or that you've lost everything because you know if it's an ntfs file system it's journaling it's there's there's actual layers of protection there that should protect you but if you were in the middle of saving a file and the file was reasonably large and then you know in that moment power was gone that file was probably corrupt ntfs was designed for fossil for devices that lie about that so yeah again it's kind of like a you know drives have lied about whether or not they've completed the right since spinning rust days so yeah so yeah that journaling part of ntfs is at least to make sure that you know protected from uh if it happens to be in the middle of updating the table that tells you where all the stuff is that's really bad for to lose right so at least that's the thing that's journaled but it's not journaling every single like right of data yeah to the drive it's just journaling the really important stuff to make sure your partition doesn't go away and things like that or you're you know can't find your files anymore that would be pretty bad and we still can't 100 trust that right yeah it really is completely crazy safety tips is getting ups and even that because you know we it's even with you know even with the crowd here yeah crash is pretty much the same thing yeah because if a crash occurs and the software the crash is bad enough the software doesn't have a chance to ride it out yeah if you get a blue screen there's stuff in the blue screen handler to try to flush whatever is in the buffers to disk but if things break hard enough it won't be able to do that uh sometimes uh pci express you know in this intrepid journey from pci express four and five foreshadowing uh you know we've created situations for consumers and others where they're getting a lot of pcie bus errors trying to use certain devices or even with uh you know our pcie adapters or i'm trying to get p5800x to run through sketchy from china pci express adapters yeah i own a few i own a few of these adapters these are the good ones these ones actually will work they'll link it gen 4 to drive you can get into obscure situations where you know the file system driver was not able to flush the buffer because of pci express bus errors yeah and so um the data that made it to the drive that the drive wrote was mangled somehow and then that leads to corruption and problems and all sorts of other things but when you have something that is persistent all of that machinery can go away and so all of the overhead and you know it becomes the next most low-hanging fruit that you completely re-engineer that the way the computer works with persistent memory as the next way to get the next bump in performance and that is a level of insanity yeah there's there's a lot of especially on the client side if you have a regular regular old windows pc like your gaming pc or whatnot and you have a drive that's flaky that's starting to like hang on rights and stuff like that that thing will go for a few minutes with the data just sitting in the buffer yeah yeah like and you'll you'll know the telltale sign is eventually the mouse will will freeze on the system or just other weird things will start happening right but it'll go minutes yeah where that was data that you thought was written to the disk a minute ago nope it's still sitting in the memory it hasn't even made it to the device yet because of bus errors or because of some other you know flakiness to the device level yeah yeah and this is also a kind of engineering where it's not enough that you have to make the device no one is going to adopt a device immediately because the entire rest of the ecosystem doesn't exist yeah so it's a and that's where we are with computers is like you have to invest in the software side and the engineering and the hardware side and then putting everything together into a product that is accessible to pleb to your developers because if you're you know an ivory tower tier developer you're already working for google or facebook or amazon or microsoft or somebody and you're probably working on special sauce like this to drive azure or you're dictating some of the hardware here and i'm sure that the intel is working really closely with huge companies to build stuff like this because this also gives them a competitive advantage if you are a day zero customer for stuff like this you can have a 10x performance improvement from this technology that's why it's exciting yeah it's cool stuff and folks like wendell and myself are the ones that are in the really bleeding edge where we run into the weird errors and try to fix them before it finally gets to the you know somebody has to put the system in the crazy config right and actually beat up on it to figure out what shakes out my role i assure you is janitorial in nature i don't know give yourself some credit man you and i are working on some weird stuff together there you know do you see this weird bug no yeah yeah i got this weird bug yeah i definitely run into some weirdness it's like why is this happening is this happening to me and then it's like you try to chase it down and it's like oh this is an intermittent bug and i just i don't i want to just it wendell you know i just happen to be you know moving 100 gigabytes per second worth of worth of bandwidth around between two parts the system oh and then this weird thing happens oh oh because people always do you know dozens or hundreds of gigabytes per second of bandwidth another crazy thing about this is that it's basically a peripheral more than a memory device but yeah you had to use the memory bus for the peripheral because the pcie bus was too slow yeah or there wasn't enough of it right like just think of how many sheer lanes you know that's 16 ssds we need another interface yeah yeah yeah and the other bunkers thing is just the ice lake with the eight you know eight memory channels per socket yeah yeah well you know uh there's a whole other conversation to be had about like memory coherency across processes and the machinery that's happening under the hood that programmers and even operating systems designers don't really know all of the black magic that is going into making sure when you write something to one place in memory that thing that was might have been cached somewhere else so did you let everybody know that had that cache that that's different now oh yeah yeah and there's so much extra logic that has to be in just a system like this where you know even if you're doing something where you think it's just on one socket yeah it's not it's it's behind the scenes it's actually duplicating some of some of that memory content even over to the other socket just to make sure that in case one of those threads happens to go over here to the other socket you want the data to be at least local and not have to come across you know later if the cache is desync it can no longer be said to be coherent yeah yeah which one do you trust and it's like i'll throw it all out go back to main memory and it's like yeah there's so many layers of extra stuff like it's it's one of those just infinite rabbit holes of once you start to dig into it and you realize just how many how many uh how many moments in in the history where you had an engineer somewhere that went oh yeah we should definitely make an extra copy over here for this yeah you know and then you realize that there's so many extra layers of machinery going on just to make it a seamless experience for someone who doesn't know how to tune around all that stuff someone who's just hey i just want to run a program yeah right whereas you know something this bonkers as a system like the perfect configuration is more akin to treating it as if it was two separate systems yeah right where you you know you make sure you're very careful about what things you put where and which lanes you're using for what to the point where like the the perfect optimum config is literally you could just like cut this all this in half and just yeah you know it would just be two end of individual systems where but then the bandwidth between those two systems is not as much right as the bandwidth from socket to socket right which is another different interface that we haven't even talked about yes yeah yeah so i mean 16 dimms 256 gigabytes per dimm two sockets 38 cores per socket 88 pci express lanes and a crapload of nvme drives across the front this is yeah you know in 10 or 20 years this is still not going to be a commodity server this would still be a high-end server in a decade yeah this isn't one of those like you know dell whatever you find on ebay for your for your home lab yeah no i don't think you're gonna these look nice oh my fortune 500 ordered 24 racks of these it's like no goldman sachs ordered about four of them all right i'm wendell i'm alan and we're gonna we're gonna go play with the server now but i thought this would be a fun chat to share you know gotta show off quarter of a million dollar server come off the crazy toys this is a crazy you know i'm crazy storage guy i gotta bring the crazy storage thing well hey i'm always down for uh more toys to play with for at least a little while you gotta take it back with you but uh you know unfortunately sorry it's a little bit of pre-release hardware here too so but hey speed gotta go fast uh next up we've got a video on pci express 5 on alder lake yeah more crazy storage stuff yeah i think it might be the first look at pci express 5 because ces was canceled that's true well it wasn't cancelled but it basically was ces didn't cancel everybody cancels ces all right let's get to the next video because everyone everyone's already clicked away looking for the next video youi have on my desk a quarter of a million dollar server which of course gets an introduction before mr allen from intel you brought a very very expensive server uh i did i figured uh you know the whole ces thing was a little bit of a bust this year we're having many ces at level one yeah i'll just i'll just bring bring the ces to wendell wendell doesn't have to come to the ces this was literally gonna be one of the show floor demos with the four terabytes of persistent memory i yeah that's not even that's not even the crazy part 88 pci express lanes to the front yeah there's no switch they're all wired there's no switches in here they're all just hbas that are one-to-one just re-timer chips and so like it's literally this is as i don't know i don't even know the right way to describe it it's just bonkers it's like every pot almost every possible pcie lane from every possible connection point on this board like there's i don't know there's these slim size connectors that go into like dual four lanes into a single eight lane slim sass just to make these happen and then there's four of those so there's like 32 lanes just from the front and that's what that's what would normally go to like the front panel nvme of a standard build do you know how hard it is to get those cables it's impossible you can't even look these up i've tried looking up some of these part numbers yeah you can't even order them right now and then so it's like well that's not enough okay well then there's you know there's we'll just add these three retimer cards so that's like another 48 lanes because each retimer has two of the slim sas connectors and then those are all going up there and then that's not enough that only adds up to 80. so i had to like i couldn't figure out like for a while there's eight more where's the where's the like i'm doing the math and i'm looking and no there's snuck in here in the back corner and we'll get some b-roll in there there's another slim sass that just happens to be off the board here and then that's routed to another spot over there so this is a super micro chassis yeah and this is designed for very very specialty applications you are not going to run to your cio and say hey we need to buy this this is a this is a server that requires different computer science pretty much yeah especially if you're doing the pmem config now the pmem config could apply to you know you don't need the bunker's front panel stuff to do the pman piece of this but this is just sort of everything all in one yeah chassis right well the thing so computers were never all the software that we have now is not designed for persistent memory and the thing that our viewers probably may not really be super clued into is that because the memory is not persistent we actually do a lot of herculean things in software at least important software right to try to make sure that we don't lose data and so this is the perfect system you're thinking for running a database system this is pretty much you know you're looking at oracle as pretty much your only option for software that is actually validated for this kind of a configuration there are patches for postgresql don't get me wrong but you run into scalability problems with databases because databases assume that they're going to lose information that's in memory a microsoft sql server and oracle and a couple other commercial database systems do have a provision for persistent memory but it completely changes the performance landscape if the database server can count on the memory not losing information then we're talking orders of magnitude speed up in the operation of the database server and that's why somebody would pay a quarter of a million dollars for this much persistent memory right and especially if just the persistent memory level like each one of these sticks will do you know what your your obtain your crazy obtain drives like you've got the p5800 xp 1500 yeah well i mean you got a couple of them and that's that's you know 3.6 terabytes those but those devices your your latencies are you know lower than 10 microseconds which is impressive and actually dip down into like the sixes and the sevens when you depending on if you're in linux and if you're doing actual you know good workloads that can take advantage of it uh the pmm sticks you're down to like point one to point three a lot more bandwidth in other words well that's for latencies yeah right so you can get relatively high bandwidth worth of very small requests that are just peppering the device right so that's handy but then if you do hit it in a straight line each one of those dims will do over eight gig per second and there's 16 of them yeah so just in raw bandwidth i mean yeah it's not you would think that the bandwidth would be uh higher because it's plugged into a dimm slot and you'd be like oh yeah that should go way faster than that but then you're just uh limited at the device level yeah really for just how much bandwidth can you get from the octane chips on the you know on the part itself and this platform is so bleeding edge that if you wanted to do a mixed configuration mixed persistent memory and mixed optane that's not necessarily going to work either you're going to have to consult an expert to help you configure that system to make sure that you're not basically specifying an invalid configuration right yeah so this is 200 series pmem which is the newer newer pmm style that's what you would use for ice lake and then speaking of bunkers config so this config is i don't even think the cpus are supposed to be in this config because they're 8368qs which i think are supposed to be water cooled oh yeah or they're meant to be uh but you know i'm bending the rules a little because i can i guess um 400 watts no they only they still only tdp at 270 watts a piece uh but they're bent they're like three bins higher so they're you know 300 megahertz higher uh clock for boost yeah um so you could you sort of make up the difference because these are only 38 core instead of 40. so you have the crazy bunkers 40 core yeah i have the 8380s i would much rather have these cpus just because the 8380s the 380s are great for a multi-core workload and they're they're you know a lot of the hypervisors i mean intel i think has been working with hypervisors because you get some of the i've got some of the sky lake and cascade lake uh cast offs from amazon and all amazon cares about is disable the c-states set the maximum you know all-core turbo and forget about it they literally don't want anything else and the 83 the 83 80s that is what they're built to do but these with the turbo where you have kind of a bursty workload it's a much better situation yeah you get a little more speed uh well actually even the all core is a little bit better but depending on your workload if your all-core workload is just hitting the 270 you're basically going to get the same overall speed you know regardless of which part it is right um but for any of those workloads that are easier on the cores where you know you can stick within your tdp budget and still get to a higher turbo these will go to a slightly higher turbo which is handy i've also noticed that with the most recent super micro bios and the super micro desktop board that i have with the 8380s yeah that they're unlocked basically they will turbo in you know uh oh you got that working until infinity oh yeah yeah as far as like tau yeah right yeah they'll no they'll just sit at 270 yeah all the time yeah yeah they'll say yeah yeah as long as as long as you can you know keep them cool which is not it's normally not too hard especially you've got the knock tools on your on your crazy desktop system a little harder on this uh it just has to be louder we'll see that in a minute yeah it does keep them cool it does keep them you know 60c i don't think they get much past xdc even in this build but the fans are just totally screaming yeah it's it's just a it's like a fire engine well we've also seen companies like keoxia with their 32 terabyte nvme that would be perfect for the u.2 slots you got 2.3 slots in the front of this case uh 0.3 the spec that lets you go to sas no that's still u.2. yeah these are all u.2 uh the in this config it's a little weird so it's we said 88 lanes you'd actually need more than 88 to get all 24 slots going full speed so the first two slots just i guess for shear hey this is bonkers enough with 88 lanes we don't need more so the first eight physical are sas and sata but not the first two slots or everything but the first two slots will do nvme okay right that makes sense because you usually will run a raid one for your os drive on a system like this right if you're not doing iscsi boot off the network right and you don't need your os drive necessarily sitting on nvme either right in many cases for server builds the sas slash sata is fine so you put a couple of sata disks on the first two slots and then your last 22 uh or if you wanted whatever that last 10 out of the first 12 you know you could do nvme or sas you know for those as well right it's just these three ribbons running up front and each of those three is a quad sata connector right so that gives you your first 12 like basically your first half of the bays can do sas sata and then 10 of those can do nvme yeah and that's not atypical for a chassis configuration like this yeah and it also gives you some full-size pcie slots so if you are running some types of accelerators yeah you do have you do have some left there's not many but there's there's a few left over here you can you know if you well the other concern is but maybe you weren't doing everything with store it locally to the cpus storage and not that much bandwidth out of the chassis so if you need a higher bandwidth out of the chassis at least you do have some more more lanes you know to be able to do that you're gonna have to add some other neck that's we can put a melanox hundred gig nick in there that's not a problem i have some of those right yeah you might need to be a little careful on like okay when you get into that bleeding edge you're probably not messing like you're pneuma optimizing your workloads right to the point where you're probably going to want to make sure you have possibly two nicks in there one nick on each pneuma even yeah just for to make sure you're not having a bunch of crosstalk between both sockets well yeah that's another that's another point to talk about interesting use cases for this kind of a workload you know even if you were going for maximum density on some sort of database cluster your four terabytes of persistent memory in your storage probably would make more sense spread across multiple chassis yeah because we've for the the amount of storage that we have and the persistence and uh the way the storage architecture is and a whole bunch of the other parameters here this is uh workloads that are critical that no information in memory is lost or possibly crash analysis you know you would use this kind of hardware to develop the next generation of hardware because if there is a crash the system is frozen in the crashed state and you could dump the memory through any kind of other slow interface or through it debug interface and see exactly what was in system memory at the time the crash occurred which is hugely valuable if you're into debugging oh yeah we're still very much in the whole chicken in the egg sort of state with respect to pima right so as far as people developing like all these technologies are still evolving especially software yeah to be able to take advantage of this um things have changed so like these are the blue dimms which is which signifies it's 200 series so there was a 100 series that was the original black man picked some of those up on ebay from time to time yeah uh and and those were you know the prior gen this is ice lake so you had to be cascade lake uh or possibly skylake depending sky skylake was used for like the really beta system builds super early super early for the for that original pmem uh but the 100 series you could do uh different modes you can also do different modes here but the 100 series you could do uh what's called memory mode which is it just gives you a huge pool of what looks like dram right but it's not all dram it's actually pmem plus dram is the cache right uh that's like the easiest way out that's just i don't want to do any optimization i just give me a whole bunch of memory and you're not even really benefiting from the persistent ability of the pmem you're just benefiting from it's you know dims that are up to 512 gig in capacity which is just bonkers right for for way cheaper than you would have to pay for a dram of that capacity right um but again not like not the most optimal thing as far as you know taking the most advantage of the system right if you're trying to do a database server where you did want to be able to reboot the system and still have the data there you reboot this in memory mode it's gonna wipe the dims next time it boots right so you don't want that you want the other mode which is called app direct and for app direct that's the one where you're up you're actually optimizing your software to take advantage of the dimms the dems show up as a separate device they're not added to the memory it's just a really fast you know really fast storage that just happens to be connected to the same bus as the dram yeah right and the level of software engineering that intel's already invested in that uh i mean we're talking thousands of man years of development in order to provide a you know a reasonably robust library set so the developers could take advantage of that and that's one of those mysql patches that i was mentioning is that i think some of those uh some of the patches that i've read about anyway depend on the app direct mode for storage and so they will use uh they will use that for in-flight data and mutexes and locks and some of the database machinery but all the other stuff still lives in memory so like caches and things that you know it wouldn't really hurt anything if it was lost all of that still lives in memory but again that's way crazy computer science and so you may be thinking it's this level investment is absolutely bananas is the performance gain from having some fast persistent memory really worth it and the answer is yeah yeah if with the software was designed for persistent memory it would be insanely way faster than it is today because it doesn't have to worry about losing data yeah and that's the catch is that the software still to some degree has to catch up yeah right there's still even in even with respect to dax and some other things or where dax is basically just making uh the file system on the persistent memory so it still is kind of our workaround yeah right because it's still giving you a standard like block accessible you know formattable kind of storage device right that's just almost like a flat file system kind of thing um for some of the performance testing i've been doing where i wasn't that worried about the persistence but i wanted the memory to be running in appdirect and just how fast can i throw ios at it and things like that turns out dax is actually really slow because dax is written to try to make sure that the persistence is guaranteed yeah and that you know you've done a right and the rights are atomic right and you're just so that will slow things down and even even with respect to dax there's still optimizations to be done there for that to improve right i've told you the ride has completed and it has yes yes as opposed to what a normal nvme ssds actually kind of lie to the system yeah well they cheat a good one cheats so like yeah samsung 963 which is a fabulous fabulous enterprise-grade um ssd it has a dram cache and so it will tell the operating system hey the write has completed when all of the bundle of data is in dram but that's a lie i mean you would you'd be correct if you're reasoning it out of your mind and saying wait a minute if the nvme loses power just the wrong moment the is going to lose the data well the 963 is designed with a whole bunch of power capacitors and so it's literally got a mini standby power system in it to ensure that it has enough power to ride out what was in dram in 99.999 percent of scenarios to the flash so when it says the ride is completed it's in fourth dimensionally it's not lying but if it loses power there's a special handler in there for that that's the level of crazy computer engineering yeah that we're at and this is a whole other layer on top of that yeah but like client you know nvme nand no they don't have that it's still it still tells you it's done just if the data has made it onto the device yeah it's either sitting in ram or a lot of those especially with dry unless ssds now like dramas isn't actually there's no ram like there's some sram sitting in the controller and it's it's got a little bit of ram right but for the most part that data that came in basically just passed right through and went right to the to the nand chips themselves and even though it's not committed to the nand it's sitting it in like the input buffers yeah on the nand chips right it still takes some amount of time for the rights to actually complete and consumer drives if they don't have power loss protection capacitors on them or even in some cases if they do which almost all of them don't right uh yes it told the system that it was done it wasn't actually done so it'll lead to corruption when you reboot it's like yeah yeah it won't necessarily mean the drive is bricked or that you've lost everything because you know if it's an ntfs file system it's journaling it's there's there's actual layers of protection there that should protect you but if you were in the middle of saving a file and the file was reasonably large and then you know in that moment power was gone that file was probably corrupt ntfs was designed for fossil for devices that lie about that so yeah again it's kind of like a you know drives have lied about whether or not they've completed the right since spinning rust days so yeah so yeah that journaling part of ntfs is at least to make sure that you know protected from uh if it happens to be in the middle of updating the table that tells you where all the stuff is that's really bad for to lose right so at least that's the thing that's journaled but it's not journaling every single like right of data yeah to the drive it's just journaling the really important stuff to make sure your partition doesn't go away and things like that or you're you know can't find your files anymore that would be pretty bad and we still can't 100 trust that right yeah it really is completely crazy safety tips is getting ups and even that because you know we it's even with you know even with the crowd here yeah crash is pretty much the same thing yeah because if a crash occurs and the software the crash is bad enough the software doesn't have a chance to ride it out yeah if you get a blue screen there's stuff in the blue screen handler to try to flush whatever is in the buffers to disk but if things break hard enough it won't be able to do that uh sometimes uh pci express you know in this intrepid journey from pci express four and five foreshadowing uh you know we've created situations for consumers and others where they're getting a lot of pcie bus errors trying to use certain devices or even with uh you know our pcie adapters or i'm trying to get p5800x to run through sketchy from china pci express adapters yeah i own a few i own a few of these adapters these are the good ones these ones actually will work they'll link it gen 4 to drive you can get into obscure situations where you know the file system driver was not able to flush the buffer because of pci express bus errors yeah and so um the data that made it to the drive that the drive wrote was mangled somehow and then that leads to corruption and problems and all sorts of other things but when you have something that is persistent all of that machinery can go away and so all of the overhead and you know it becomes the next most low-hanging fruit that you completely re-engineer that the way the computer works with persistent memory as the next way to get the next bump in performance and that is a level of insanity yeah there's there's a lot of especially on the client side if you have a regular regular old windows pc like your gaming pc or whatnot and you have a drive that's flaky that's starting to like hang on rights and stuff like that that thing will go for a few minutes with the data just sitting in the buffer yeah yeah like and you'll you'll know the telltale sign is eventually the mouse will will freeze on the system or just other weird things will start happening right but it'll go minutes yeah where that was data that you thought was written to the disk a minute ago nope it's still sitting in the memory it hasn't even made it to the device yet because of bus errors or because of some other you know flakiness to the device level yeah yeah and this is also a kind of engineering where it's not enough that you have to make the device no one is going to adopt a device immediately because the entire rest of the ecosystem doesn't exist yeah so it's a and that's where we are with computers is like you have to invest in the software side and the engineering and the hardware side and then putting everything together into a product that is accessible to pleb to your developers because if you're you know an ivory tower tier developer you're already working for google or facebook or amazon or microsoft or somebody and you're probably working on special sauce like this to drive azure or you're dictating some of the hardware here and i'm sure that the intel is working really closely with huge companies to build stuff like this because this also gives them a competitive advantage if you are a day zero customer for stuff like this you can have a 10x performance improvement from this technology that's why it's exciting yeah it's cool stuff and folks like wendell and myself are the ones that are in the really bleeding edge where we run into the weird errors and try to fix them before it finally gets to the you know somebody has to put the system in the crazy config right and actually beat up on it to figure out what shakes out my role i assure you is janitorial in nature i don't know give yourself some credit man you and i are working on some weird stuff together there you know do you see this weird bug no yeah yeah i got this weird bug yeah i definitely run into some weirdness it's like why is this happening is this happening to me and then it's like you try to chase it down and it's like oh this is an intermittent bug and i just i don't i want to just it wendell you know i just happen to be you know moving 100 gigabytes per second worth of worth of bandwidth around between two parts the system oh and then this weird thing happens oh oh because people always do you know dozens or hundreds of gigabytes per second of bandwidth another crazy thing about this is that it's basically a peripheral more than a memory device but yeah you had to use the memory bus for the peripheral because the pcie bus was too slow yeah or there wasn't enough of it right like just think of how many sheer lanes you know that's 16 ssds we need another interface yeah yeah yeah and the other bunkers thing is just the ice lake with the eight you know eight memory channels per socket yeah yeah well you know uh there's a whole other conversation to be had about like memory coherency across processes and the machinery that's happening under the hood that programmers and even operating systems designers don't really know all of the black magic that is going into making sure when you write something to one place in memory that thing that was might have been cached somewhere else so did you let everybody know that had that cache that that's different now oh yeah yeah and there's so much extra logic that has to be in just a system like this where you know even if you're doing something where you think it's just on one socket yeah it's not it's it's behind the scenes it's actually duplicating some of some of that memory content even over to the other socket just to make sure that in case one of those threads happens to go over here to the other socket you want the data to be at least local and not have to come across you know later if the cache is desync it can no longer be said to be coherent yeah yeah which one do you trust and it's like i'll throw it all out go back to main memory and it's like yeah there's so many layers of extra stuff like it's it's one of those just infinite rabbit holes of once you start to dig into it and you realize just how many how many uh how many moments in in the history where you had an engineer somewhere that went oh yeah we should definitely make an extra copy over here for this yeah you know and then you realize that there's so many extra layers of machinery going on just to make it a seamless experience for someone who doesn't know how to tune around all that stuff someone who's just hey i just want to run a program yeah right whereas you know something this bonkers as a system like the perfect configuration is more akin to treating it as if it was two separate systems yeah right where you you know you make sure you're very careful about what things you put where and which lanes you're using for what to the point where like the the perfect optimum config is literally you could just like cut this all this in half and just yeah you know it would just be two end of individual systems where but then the bandwidth between those two systems is not as much right as the bandwidth from socket to socket right which is another different interface that we haven't even talked about yes yeah yeah so i mean 16 dimms 256 gigabytes per dimm two sockets 38 cores per socket 88 pci express lanes and a crapload of nvme drives across the front this is yeah you know in 10 or 20 years this is still not going to be a commodity server this would still be a high-end server in a decade yeah this isn't one of those like you know dell whatever you find on ebay for your for your home lab yeah no i don't think you're gonna these look nice oh my fortune 500 ordered 24 racks of these it's like no goldman sachs ordered about four of them all right i'm wendell i'm alan and we're gonna we're gonna go play with the server now but i thought this would be a fun chat to share you know gotta show off quarter of a million dollar server come off the crazy toys this is a crazy you know i'm crazy storage guy i gotta bring the crazy storage thing well hey i'm always down for uh more toys to play with for at least a little while you gotta take it back with you but uh you know unfortunately sorry it's a little bit of pre-release hardware here too so but hey speed gotta go fast uh next up we've got a video on pci express 5 on alder lake yeah more crazy storage stuff yeah i think it might be the first look at pci express 5 because ces was canceled that's true well it wasn't cancelled but it basically was ces didn't cancel everybody cancels ces all right let's get to the next video because everyone everyone's already clicked away looking for the next video you\n"

BigTech - $250k of Persistent Memory w_Allyn

Random Videos