CPU Pipeline - Computerphile

The Complexity of CPU Pipelining: Understanding the Process

Modern CPUs are incredibly powerful devices that can execute thousands of instructions per second. However, behind this complexity lies a fascinating process known as pipelining, which allows these devices to process information efficiently and effectively. In essence, pipelining involves breaking down complex tasks into smaller, more manageable chunks, and then processing each chunk in a specific order.

The Pipeline Process

To understand how pipelining works, let's imagine a simple pipeline process. Imagine you're trying to cook breakfast, and you have three dishes to prepare: scrambled eggs, toast, and bacon. Normally, you would do these tasks one by one, but with a pipeline, you can think of them as separate stages that work together in harmony. Stage 1: fetching the ingredients (e.g., cracking eggs, grabbing bread). Stage 2: cooking the ingredients (e.g., scrambling eggs, toasting bread). Stage 3: assembling and serving the dish (e.g., adding bacon, putting it on a plate). Each stage is like a separate robot that works independently, but together they create a smooth and efficient process.

In a CPU pipeline, each stage represents an instruction or a set of instructions. The fetching stage retrieves the instruction from memory; the decode stage decodes the instruction to determine what needs to be done; the execute stage performs the actual computation; the store stage stores the result; and the commit stage commits the result to the system's main memory.

When we talk about pipelining, we're really talking about how these stages work together to process information. The key is that each stage can operate independently of the others, but they all need to be coordinated so that the output of one stage becomes the input for the next stage. This coordination is crucial because it allows us to process multiple instructions in a single clock cycle.

Modern CPUs have hundreds of stages, not just three like our simple breakfast analogy. These stages can be thought of as separate "robots" that work together to process information. The fetching robot retrieves instructions from memory; the decode robot decodes the instruction; the execute robot performs the computation; and so on. Each stage has its own set of problems to solve, but they all need to work together seamlessly.

One important aspect of pipelining is making sure the pipeline is filled with useful work at all times. If the pipeline is idle, it's like waiting for a firehose that's not putting out any water – you're wasting resources and time. To avoid this, we have to make sure each stage has enough instructions to process before moving on to the next one.

Another challenge in pipelining is accessing the "pigeon holes" where data is stored. Imagine having a fetcher robot that's constantly retrieving data from memory, while another robot is trying to write back the results to memory. In this scenario, we might encounter problems with contention between the two robots, which could slow down the entire process.

To mitigate these issues, modern CPUs use various techniques such as caching and pipelining stages to reduce contention and improve performance. The instruction cache stores frequently accessed instructions in a separate location, allowing the fetcher robot to access them more quickly. Similarly, the data cache stores frequently accessed data in a separate location, reducing the need for the robots to contend with each other.

Finally, there's an important consideration when it comes to pipeline hazards – situations where multiple instructions depend on the result of previous instructions. In these cases, we might need to stop the pipeline while the earlier instruction finishes before moving on to the next one. This can be a challenging problem to solve, but it's essential for ensuring that our CPU pipelines operate correctly and efficiently.

The Power of Modern CPUs

In recent years, modern CPUs have become incredibly powerful devices that can execute thousands of instructions per second. One notable example is the Sega Dreamcast, which featured an Hitachi SH4 processor with a 2-in-1 pipelining system. This allowed the CPU to process two instructions simultaneously, but only if they were carefully paired together.

To make this work, developers had to carefully select pairs of instructions that didn't interfere with each other and would run smoothly in tandem. This was a challenging task, but it ultimately led to improved performance and efficiency. Today, modern CPUs continue to evolve and improve, with new technologies and innovations being developed all the time.

In conclusion, CPU pipelining is a fascinating process that allows modern CPUs to process information efficiently and effectively. By breaking down complex tasks into smaller stages and processing each stage in a specific order, we can achieve incredible speeds and performances. However, it's not without its challenges – from accessing pigeon holes to managing pipeline hazards – but with careful design and optimization, these challenges can be overcome to create some of the most powerful devices on the planet.

"WEBVTTKind: captionsLanguage: enlast time we talked about addition didn't we was it addition last time we did some some work we I kind of slowly snuck in the fact that my madeup CPU was in fact a real CPU the one that I grew up with and I showed how we could use it to do addition of multiple u values to show that arbitrarily big numbers could be added up by this relatively simple system what I'd like to show today is how CPUs have sort of advanced a little bit beyond that very simplistic View and this is how one of the reasons why they're so fast these days and I think there's going to be a whole number of like little technologies that Each of which feeds into the next and if I explain one without the others it' be like why on Earth would it need to do that kind of thing so we're going to start like the basics so that where we start is uh if you remember our story so far we have a little robot in my mind that sits inside the computer and does all of the work and the work is relatively straightforward pieces like read a number from from memory add it to another number store it back into memory compare numbers those kinds of things and um the way that we realized that that we could explain to the robot what it needs to do is that we could write down numbers that correspond to the kinds of things that it might want to do so like one might be add two might be subtract three might be multiply those kinds of things and then those very numbers themselves can be put into the memory alongside the data you're manipulating the robot's kind of job is we've got our pigeon holes that are the memory and we've got like number one over here which means add so the robot goes first of all looks like what is what am I supposed to do this step of the recipe one looks up in his big chart and says oh one means add okay so now I need to do some adding together and then he goes over to his Abacus and he does the addition or whatever and then he's got the result back and he looks in the next slot and it says you know two is store he goes okay well then I guess I should take that number and I should put it somewhere else in the pigeon holes and that's how our whole computer program works now so there's kind of three steps that the robot goes through for every single step of the recipe the first thing is he has to go and look up what is the next step in the recipe read out the number number or numbers from the the the memory cells that correspond to the instruction that he's going to be doing then he has to make sense of them in somehow you know maybe you know he has to look up in his chart that one means add and two means whatever and maybe also if it says add he has to go and get another number which is like well what is the number I'm going to be adding to right and in the third step he actually does the addition he executes I don't know why he's a he I I was actually thinking about this I was thinking should the robot be maybe non-binary but then it seems really I onic to call it a binary robot a non-binary but so like they he she the robot it it let's be it goes and does the um uh the actual work with the with the Abacus and if you can imagine on the circuit that actually corresponds to the robot's mind those parts of the chip are very separate the bit that goes out to memory and pulls numbers out of memory the bit that has a big table that says this is what this means and the bit that has all the Abacus Parts as it were which is is actually we call the ALU the arithmetic and logic unit those are very separate parts of the chip which means that each phase there's a whole part of the chip that's not being used and that seems incredibly wasteful there's only so much we can do in every tick of the clock so behind the scenes there's a sort of like a metronome going on that says you know now it's time to do the next piece of work and you know modern computers that metronome runs at gigahertz back in the day that I'm talking about here we're talking about megaherz thousands of times a second but it still means that each of those steps is one tick of the clock so fetch decode that is work out what it is and then execute and then fetch the next thing decode execute most of those steps don't overlap with each other so one of the classic sort of solutions to a problem where you're going to be doing a sequence of work with with repetitive steps is to build a production line much like you know the Model T Ford where you know the first thing you do is you assemble the chassis and then someone else puts the engine block in and then someone else puts the wheels on and then someone else polishes the car and then the car comes off the end of the the railway Railway the the the conveyor belt whatever it is um we're going to do the same thing inside our chip so the way that I like to think of this is that we're going to break fetching decoding and executing into three steps on along a sort of conveyor belt inside the robot I guess and on the far left hand side again with the the pen that's running out now here is our robot at this end of the pipeline uh conveyor belt whatever we're going to call it there is his little is he's got I think of it like a spade right he's he's actually going to shovel on the the bites from memory he's just reading the next number from memory one after another and he's he's putting them onto this this conveyor belt our numbers you know one and then 30 and then 50 he's just picking them up and putting them onto the conveyor belt one after another with no real thought no logical thought behind it somewhere down the conveyor belt is the decoding robot and the decoding robot has in my my idea he's got a sort of magnifying glass which is about the only limit of my drawing here and he's got the book of all of what do these op codes mean what do these numbers actually mean you know is this an ad or one means an ad and the next two numbers mean well those are the numbers I'm going to add together and so he waits until he's got enough information on the pipeline on the conveyor B to be able to decode an entire instruction and then he writes down a new piece of on a card uh internal that just says you know like add 10 to address number 57 or whatever this sorry it's my too small to see but that's what he's doing he's going to add it onto a new piece of card that goes down the conveyor bout to the next person and the last person on the line in this instance here is our robot who has the Abacus he then picks up the card and uh it says oh add 10 to memory location 57 pulls the number out from memory location 57 adds 10 to it with the Abacus taking whatever time it is and you could imagine that might take longer for like multiplication or division or whatever and then he'll write it down on one of those you know pieces of paper that we've got we call them registers before you know we had like the a registered the X register whatever however many registers our CPU has and we haven't fundamentally changed anything about what was going on we just broken what was three pieces of work that one robot did into three pieces of work that three robots do one after another any one instruction takes three stages to get through the system he has to go through the fetching he has to go through the decoding and he has to go through the execution so it still takes three metronome clicks before that particular thing has finished but what you'll notice is that if we can keep it fed if this this robot on the far end can get Shovel bites in fast enough and this uh decoding robot can decode fast enough then every single metronome click this executing robot is doing useful work he gets to use the Abacus every single tick of the clock so we're getting much more use out of our CPU than we were before previously the Abacus was only being used every third instruction cycle so it's like wasted um and that means that although every any one instruction takes three Cycles to go through our system every cycle a new thing rolls off the end of the production line and that's fantastic and it makes it go a lot faster now there's a there's a sort of catch though as there always is with these things right it seems great we could stop the video here and be like Oh brilliant computers go three times faster if you break it into three bits but the eagle ID among you may have realized that uh this is all well and good good provided our robot here on the far left hand side knows what the next thing to do is always and it's a reasonable guess just to keep reading the next recipe instruction out every single tick of the clock right but we know that one of the things that the execute unit can do is to Branch to say actually I've made a decision and I'm going to start doing something different now like I compared the two numbers and uh actually we're going to go to the top of the loop now so if I draw out are a short version of the program that we had before it looks something like load from like address 100 say which this was the Fibonacci program which I realized now that I was doing wrong but we're going to stick with what I was doing before we're going to add the next number this x is just an offset to say which number of the Fibonacci sequence are we doing and then we're going to store at 102 comma X and then we're going to increment X and then we're going to compare X to say have we done a 100 of these numbers cuz after I need to stop for example and if x was not equal to 100 here we're going to Branch if not equal back to the top of this Loop up here so this means the sequence of bites so each of these is then corresponds to a some number of numbers in the the pigeon hole you know this may be one followed by 100 this may be two followed by 101 this may be three followed by 102 this may be seven to increment the X this may be eight and then 100 to say compare with one and then nine and then something that says back to the top of the loop so the first number there is the instruction number and the second one is the uh is is this this number that's part of the uh yeah the address in this particular instance the address of this of the thing that did in the sequence yeah okay exactly right yes yeah and there there's not an obvious way to tell that just looking at that sequence of numbers one 1002 101 which ones are instructions which ones are data which ones are numbers whatever like that that's partly what our decode robot has to do is make sense of this stream of numbers and say well that oh these three things correspond to one instruction um so it's not it's not terribly obvious you can't just stare at it and go oh there's a sequence numbers I know exactly what that is and that that that process is called uh when you when when a human does this is called disassembly you're actually disassembling the code back into the assembly human readable version over here I say human readable obviously you know it's not that human readable but it's more human readable than a bunch sequence of numbers quicker side question on that though assembler and assembly what's the difference between those two it's one of those things where I think Cally people use it wrongly all the time so assembly um is the name of the the the the text format and assembler is actually the name of the program that assembles the code but we somewhat interchangeably use them and I'm as guilty as the next person for saying assembly when I mean machine code as well or assembly assembler and instruction or whatever it's all a bit vague but you know the the sort of Distinction is that the the numbers are the machine code that's the thing the machine understands the assembly or you know these op codes if is is the assembly code that you assemble with the assembl to make machine code but good question and very very common we're there we're there we're there go for it yeah no absolutely so we've got this sequence of numbers and if we were to just uh these would be laid out in memory sequentially and so we'd be absolutely fine our robot who's feeding the the front of the pipeline would be fine right up until we get to this Branch where by the time the branch has made it through there's maybe some other instructions here what are we going to call this let's say let's say that the next instruction was return because it would be the end of the routine and then there was just any any not garbage afterwards actually rubbish afterwards by the time this Branch instruction has made it through the pipeline to the execute where the executing robot goes oh wait a second we need to go somewhere else now the pipeline is already filled with the decode robots already decoded this return which we don't want to do and the fetcher is already fetch probably the bites that correspond to whatever these things are immediately afterwards and so in the simplest form which is what most of the original CPUs would do um the execute robot kind of pulls the breake time horn you know The Flintstones thing right everyone stops um and the pipeline is restarted everything is thrown off of the pipeline everything is pushed into the trash into the rubbish and we start again so then from that Branch from the top of the loop we start the pipeline up now that means that we have to wait through recycles before any useful work gets done again because you know we started from scratch and that's terribly unfortunate but if we don't Branch very often then it's still a net win to do this the problem is we do Branch quite a lot is kind of the every third or fourth instruction in most um programs is a branch of some sort and so it's really really worth our while trying to keep this pipeline full of useful work rather than than discarding it flushing the pipeline um so there are several approaches to solving this conundrum and we'll we'll do a couple of them and then the one that we use nowadays I think is big enough to its own topic one thing we could do is we could say how about the next instruction after a branch gets executed anyway even though uh it may not be following the path of that Branch if the branch says go somewhere else we just say hey let's do that work anyway it's already fetched it all the way through the pipeline um we've we' we've decoded it it's to start executing what the heck let's execute it anyway now that means that the programmers have to know that that's happening we have to write our code specially so in the particular case of our Loop here I don't even know what you would do in our particular Loop here but you would try and find another piece of useful work that you needed to do anyway and you would put it immediately after the branch so you say Branch if not equal to Loop and then some other useful instruction here now again in our particular example here there isn't a useful thing so we'd actually end up putting a KN there there don't do anything here on purpose to say look it's better than flushing the pipeline but um don't do anything here um maybe you could be clever and you could move the the load from the top of the loop down into the bottom of the loop and then say okay that we load the next number ready and then instead of jump jumping all the way to the top of the loop we Branch to the second address of the loop so you can kind of move it around and this is where things like compilers that write the code on behalf of you they can do these things for you this little um instruction immediately after a branch that's executed anyway is called a delay slot and it's a very simplistic way of being able to continue doing useful work even when the pipeline would otherwise be flushed it just puts the burden on the poor programmer to write their code in a particular way to take advantage of it so that's one way you can do it um another way you can do it is and this is what the the folks at arm decided to do is that if you take a look at the majority of branches they with the exception of these Loop ones where we go back to the top often times you might want to do something along the lines of compare it with 10 if it's greater than 10 then set it to 10 otherwise leave it alone so that would be like taking the maximum value of of something your instructions look something like compare x with 10 Branch if less than to skip that is hey Skip some instruction I'm going to write skip down here and then we're going to say load x with 10 so this is clamping my value Val of x to say the highest number I ever want X to be is 10 whatever it came in as so I'm going to compare with 10 if it's less than 10 then skip over the next instruction I just don't want to do that next instruction and then the rest of my program carries on here otherwise set X to be 10 and that's a very common thing you might want to do right you you you or you might have some other branch which is like well if it's not 10 then it then make it minus 10 or some other thing like this so most of these branches are just skipping over one or two instructions before they carry on and and and blend in with the rest of the the flow of instructions so what the clever Folks at arm decided to do at least with the original arm processor from like the late ' 80s early 90s is they said well what if instead of just this Branch instruction having this sort of like condition Branch if Branch if less than Branch if not equal what if we made every instruction conditional so in the case of like arm and this is where I'm going to show up the fact I haven't done this for years and years and years we might do something like compare register 10 with the number 10 and I realized this is a slightly different style I just want to do this in case people are familiar with arm and then what I might do is move if greater than register 10 with number 10 there's no Branch anymore now I've just said do the comparison and then this move instruction doesn't do anything unless the comparison was greater than it's it's essentially a KN if that's not true through and so now my pipeline doesn't have to be restarted I just have a bubble of one instruction that either gets executed if I need it to or it doesn't get executed and just flows throughout the other side of the pipeline so I still haven't had to rester everything I haven't discarded the instruction after that's work and so that's a really smart way of for small sequences of instructions avoiding having to restart the whole pipeline just by saying well skip that one okay now we're good again so it's kind of similar to the the delay slot you're pushing the burden on the programmer a little bit but you given them a tool to be able to do this and then the last thing is to somehow teach the fetch robot at the front of the pipeline how to be Clairvoyant and to look into the future and predict and make some guess as to where the flow of execution is going to go even though that robot has no other information at all and then I think that's a whole other topic because that is how most modern processes work now and it's called Branch prediction it's a it's it's a whole very interesting area of of uh exploration and it's all in uh Aid of trying to keep this pipeline of fetch decode and execute filled with useful work that we don't have to throw away and in modern CPUs I've described it as three steps here but modern CPUs have hundreds of steps so it's even more important that you steer the front of the pipeline in the correct direction otherwise you spend most of your time waiting for this pipeline to fill back up again I imagine this kind of fire hose of data if not the opposite redundant data coming out the end it is exactly right and it is a fire hose these days because um another trick that that CPUs have up their sleeve is that they can you maybe they have what if you had two uh abacuses imagine you had two of them right now you could actually imagine having two of these robots at the end of the pipeline with their abacuses and if you can fill this pipeline up fast enough and you can see that there are two instructions one after another and neither of them refer to each other so one's adding 10 to the X register the other one's subtracting one from the Y register why don't I do them at the same time and my two robots at the end of the pipeline could do two instructions at the same time and in order to make that work I have to fill this pipeline up faster and faster and faster to get enough work to be able to do more than one thing at once and that's another reason why we have to make sure the pipeline is filled with useful work at all times and and that there's these complicated thousands of steps or tens of steps uh that happen it's so cool it's so amazing and we don't most of the time we don't think about it we don't know about it one last question on on that then um is there any issue with accessing the the pigeon holes uh you know so I'm thinking you've got a fetcher you've got someone that's trying to do some adding and storing and is are those things that need to be managed as well absolutely yes so there are two parts that I sort of glossed over a little bit here uh one is that yes maybe the pigeon hole only one robot at a time can access the pigeon hole and so if the executing robot needs to write back into the pigeon hole while the fetching robot is trying to fetch more things to put into the pigeon hole then yes you might be in trouble and that's solved by um giving the uh doing solved in two ways the fetcher can pick up lots of things at once from the pigeon hole so he goes to the pigeon holes and gets 16 of them in a row kind okay that gets me he cashes some of that and then yeah he also can have a cach uh up here where he's like well I keep fetching the same things over and over again why don't I just have my own private copy of what's in those pigeon holes and then I don't have to the pigeon holes as much and similarly on the executing side there are caches here this is why we typically have an instruction cache separate from a data cache so that we can have the two robots like not have to fight with each other so you're absolutely right there is definitely some things that have to be thought about there the other thing is that in this very simple um process where I said you know the one thing executes after another we don't have any things uh where um if it took for example multiple ticks of the to do the work maybe a multiply takes three ticks of the clock then um if one instruction uses the result of the instruction preceding it I still have to wait for the whole thing to finish beforehand and I have to sort of stop the pipeline while the multiply finishes before I can get the result out to then maybe add one to it at the end and though those kinds of those are called hazards data hazards and uh uh that there is definitely some circuitry that has to like look out for well I can't start the next instruction until the previous one has definitely finished if it depends on it and that also comes into play with our multiple robots over here so we've actually managed to get to about early 2000s era now of of CPUs so this was the Sega Dreamcast which is my favorite game console to work on had an Hitachi sh4 processor that could do two things at once and there were these rules about which two instructions the circuitry could detect would didn't interfere with each other and thus could run together and so you spent your entire time trying to write pairs of instructions that you knew would go together so that you got twice as many instructions get through getting through the pipeline each tick of the clock and then we're going to increment the index and incrementing the index moves it along to the next cell then we're going to say well add to the accumulator using our Abacus so sometimes the way this is done well again in this we're going to go to thelast time we talked about addition didn't we was it addition last time we did some some work we I kind of slowly snuck in the fact that my madeup CPU was in fact a real CPU the one that I grew up with and I showed how we could use it to do addition of multiple u values to show that arbitrarily big numbers could be added up by this relatively simple system what I'd like to show today is how CPUs have sort of advanced a little bit beyond that very simplistic View and this is how one of the reasons why they're so fast these days and I think there's going to be a whole number of like little technologies that Each of which feeds into the next and if I explain one without the others it' be like why on Earth would it need to do that kind of thing so we're going to start like the basics so that where we start is uh if you remember our story so far we have a little robot in my mind that sits inside the computer and does all of the work and the work is relatively straightforward pieces like read a number from from memory add it to another number store it back into memory compare numbers those kinds of things and um the way that we realized that that we could explain to the robot what it needs to do is that we could write down numbers that correspond to the kinds of things that it might want to do so like one might be add two might be subtract three might be multiply those kinds of things and then those very numbers themselves can be put into the memory alongside the data you're manipulating the robot's kind of job is we've got our pigeon holes that are the memory and we've got like number one over here which means add so the robot goes first of all looks like what is what am I supposed to do this step of the recipe one looks up in his big chart and says oh one means add okay so now I need to do some adding together and then he goes over to his Abacus and he does the addition or whatever and then he's got the result back and he looks in the next slot and it says you know two is store he goes okay well then I guess I should take that number and I should put it somewhere else in the pigeon holes and that's how our whole computer program works now so there's kind of three steps that the robot goes through for every single step of the recipe the first thing is he has to go and look up what is the next step in the recipe read out the number number or numbers from the the the memory cells that correspond to the instruction that he's going to be doing then he has to make sense of them in somehow you know maybe you know he has to look up in his chart that one means add and two means whatever and maybe also if it says add he has to go and get another number which is like well what is the number I'm going to be adding to right and in the third step he actually does the addition he executes I don't know why he's a he I I was actually thinking about this I was thinking should the robot be maybe non-binary but then it seems really I onic to call it a binary robot a non-binary but so like they he she the robot it it let's be it goes and does the um uh the actual work with the with the Abacus and if you can imagine on the circuit that actually corresponds to the robot's mind those parts of the chip are very separate the bit that goes out to memory and pulls numbers out of memory the bit that has a big table that says this is what this means and the bit that has all the Abacus Parts as it were which is is actually we call the ALU the arithmetic and logic unit those are very separate parts of the chip which means that each phase there's a whole part of the chip that's not being used and that seems incredibly wasteful there's only so much we can do in every tick of the clock so behind the scenes there's a sort of like a metronome going on that says you know now it's time to do the next piece of work and you know modern computers that metronome runs at gigahertz back in the day that I'm talking about here we're talking about megaherz thousands of times a second but it still means that each of those steps is one tick of the clock so fetch decode that is work out what it is and then execute and then fetch the next thing decode execute most of those steps don't overlap with each other so one of the classic sort of solutions to a problem where you're going to be doing a sequence of work with with repetitive steps is to build a production line much like you know the Model T Ford where you know the first thing you do is you assemble the chassis and then someone else puts the engine block in and then someone else puts the wheels on and then someone else polishes the car and then the car comes off the end of the the railway Railway the the the conveyor belt whatever it is um we're going to do the same thing inside our chip so the way that I like to think of this is that we're going to break fetching decoding and executing into three steps on along a sort of conveyor belt inside the robot I guess and on the far left hand side again with the the pen that's running out now here is our robot at this end of the pipeline uh conveyor belt whatever we're going to call it there is his little is he's got I think of it like a spade right he's he's actually going to shovel on the the bites from memory he's just reading the next number from memory one after another and he's he's putting them onto this this conveyor belt our numbers you know one and then 30 and then 50 he's just picking them up and putting them onto the conveyor belt one after another with no real thought no logical thought behind it somewhere down the conveyor belt is the decoding robot and the decoding robot has in my my idea he's got a sort of magnifying glass which is about the only limit of my drawing here and he's got the book of all of what do these op codes mean what do these numbers actually mean you know is this an ad or one means an ad and the next two numbers mean well those are the numbers I'm going to add together and so he waits until he's got enough information on the pipeline on the conveyor B to be able to decode an entire instruction and then he writes down a new piece of on a card uh internal that just says you know like add 10 to address number 57 or whatever this sorry it's my too small to see but that's what he's doing he's going to add it onto a new piece of card that goes down the conveyor bout to the next person and the last person on the line in this instance here is our robot who has the Abacus he then picks up the card and uh it says oh add 10 to memory location 57 pulls the number out from memory location 57 adds 10 to it with the Abacus taking whatever time it is and you could imagine that might take longer for like multiplication or division or whatever and then he'll write it down on one of those you know pieces of paper that we've got we call them registers before you know we had like the a registered the X register whatever however many registers our CPU has and we haven't fundamentally changed anything about what was going on we just broken what was three pieces of work that one robot did into three pieces of work that three robots do one after another any one instruction takes three stages to get through the system he has to go through the fetching he has to go through the decoding and he has to go through the execution so it still takes three metronome clicks before that particular thing has finished but what you'll notice is that if we can keep it fed if this this robot on the far end can get Shovel bites in fast enough and this uh decoding robot can decode fast enough then every single metronome click this executing robot is doing useful work he gets to use the Abacus every single tick of the clock so we're getting much more use out of our CPU than we were before previously the Abacus was only being used every third instruction cycle so it's like wasted um and that means that although every any one instruction takes three Cycles to go through our system every cycle a new thing rolls off the end of the production line and that's fantastic and it makes it go a lot faster now there's a there's a sort of catch though as there always is with these things right it seems great we could stop the video here and be like Oh brilliant computers go three times faster if you break it into three bits but the eagle ID among you may have realized that uh this is all well and good good provided our robot here on the far left hand side knows what the next thing to do is always and it's a reasonable guess just to keep reading the next recipe instruction out every single tick of the clock right but we know that one of the things that the execute unit can do is to Branch to say actually I've made a decision and I'm going to start doing something different now like I compared the two numbers and uh actually we're going to go to the top of the loop now so if I draw out are a short version of the program that we had before it looks something like load from like address 100 say which this was the Fibonacci program which I realized now that I was doing wrong but we're going to stick with what I was doing before we're going to add the next number this x is just an offset to say which number of the Fibonacci sequence are we doing and then we're going to store at 102 comma X and then we're going to increment X and then we're going to compare X to say have we done a 100 of these numbers cuz after I need to stop for example and if x was not equal to 100 here we're going to Branch if not equal back to the top of this Loop up here so this means the sequence of bites so each of these is then corresponds to a some number of numbers in the the pigeon hole you know this may be one followed by 100 this may be two followed by 101 this may be three followed by 102 this may be seven to increment the X this may be eight and then 100 to say compare with one and then nine and then something that says back to the top of the loop so the first number there is the instruction number and the second one is the uh is is this this number that's part of the uh yeah the address in this particular instance the address of this of the thing that did in the sequence yeah okay exactly right yes yeah and there there's not an obvious way to tell that just looking at that sequence of numbers one 1002 101 which ones are instructions which ones are data which ones are numbers whatever like that that's partly what our decode robot has to do is make sense of this stream of numbers and say well that oh these three things correspond to one instruction um so it's not it's not terribly obvious you can't just stare at it and go oh there's a sequence numbers I know exactly what that is and that that that process is called uh when you when when a human does this is called disassembly you're actually disassembling the code back into the assembly human readable version over here I say human readable obviously you know it's not that human readable but it's more human readable than a bunch sequence of numbers quicker side question on that though assembler and assembly what's the difference between those two it's one of those things where I think Cally people use it wrongly all the time so assembly um is the name of the the the the text format and assembler is actually the name of the program that assembles the code but we somewhat interchangeably use them and I'm as guilty as the next person for saying assembly when I mean machine code as well or assembly assembler and instruction or whatever it's all a bit vague but you know the the sort of Distinction is that the the numbers are the machine code that's the thing the machine understands the assembly or you know these op codes if is is the assembly code that you assemble with the assembl to make machine code but good question and very very common we're there we're there we're there go for it yeah no absolutely so we've got this sequence of numbers and if we were to just uh these would be laid out in memory sequentially and so we'd be absolutely fine our robot who's feeding the the front of the pipeline would be fine right up until we get to this Branch where by the time the branch has made it through there's maybe some other instructions here what are we going to call this let's say let's say that the next instruction was return because it would be the end of the routine and then there was just any any not garbage afterwards actually rubbish afterwards by the time this Branch instruction has made it through the pipeline to the execute where the executing robot goes oh wait a second we need to go somewhere else now the pipeline is already filled with the decode robots already decoded this return which we don't want to do and the fetcher is already fetch probably the bites that correspond to whatever these things are immediately afterwards and so in the simplest form which is what most of the original CPUs would do um the execute robot kind of pulls the breake time horn you know The Flintstones thing right everyone stops um and the pipeline is restarted everything is thrown off of the pipeline everything is pushed into the trash into the rubbish and we start again so then from that Branch from the top of the loop we start the pipeline up now that means that we have to wait through recycles before any useful work gets done again because you know we started from scratch and that's terribly unfortunate but if we don't Branch very often then it's still a net win to do this the problem is we do Branch quite a lot is kind of the every third or fourth instruction in most um programs is a branch of some sort and so it's really really worth our while trying to keep this pipeline full of useful work rather than than discarding it flushing the pipeline um so there are several approaches to solving this conundrum and we'll we'll do a couple of them and then the one that we use nowadays I think is big enough to its own topic one thing we could do is we could say how about the next instruction after a branch gets executed anyway even though uh it may not be following the path of that Branch if the branch says go somewhere else we just say hey let's do that work anyway it's already fetched it all the way through the pipeline um we've we' we've decoded it it's to start executing what the heck let's execute it anyway now that means that the programmers have to know that that's happening we have to write our code specially so in the particular case of our Loop here I don't even know what you would do in our particular Loop here but you would try and find another piece of useful work that you needed to do anyway and you would put it immediately after the branch so you say Branch if not equal to Loop and then some other useful instruction here now again in our particular example here there isn't a useful thing so we'd actually end up putting a KN there there don't do anything here on purpose to say look it's better than flushing the pipeline but um don't do anything here um maybe you could be clever and you could move the the load from the top of the loop down into the bottom of the loop and then say okay that we load the next number ready and then instead of jump jumping all the way to the top of the loop we Branch to the second address of the loop so you can kind of move it around and this is where things like compilers that write the code on behalf of you they can do these things for you this little um instruction immediately after a branch that's executed anyway is called a delay slot and it's a very simplistic way of being able to continue doing useful work even when the pipeline would otherwise be flushed it just puts the burden on the poor programmer to write their code in a particular way to take advantage of it so that's one way you can do it um another way you can do it is and this is what the the folks at arm decided to do is that if you take a look at the majority of branches they with the exception of these Loop ones where we go back to the top often times you might want to do something along the lines of compare it with 10 if it's greater than 10 then set it to 10 otherwise leave it alone so that would be like taking the maximum value of of something your instructions look something like compare x with 10 Branch if less than to skip that is hey Skip some instruction I'm going to write skip down here and then we're going to say load x with 10 so this is clamping my value Val of x to say the highest number I ever want X to be is 10 whatever it came in as so I'm going to compare with 10 if it's less than 10 then skip over the next instruction I just don't want to do that next instruction and then the rest of my program carries on here otherwise set X to be 10 and that's a very common thing you might want to do right you you you or you might have some other branch which is like well if it's not 10 then it then make it minus 10 or some other thing like this so most of these branches are just skipping over one or two instructions before they carry on and and and blend in with the rest of the the flow of instructions so what the clever Folks at arm decided to do at least with the original arm processor from like the late ' 80s early 90s is they said well what if instead of just this Branch instruction having this sort of like condition Branch if Branch if less than Branch if not equal what if we made every instruction conditional so in the case of like arm and this is where I'm going to show up the fact I haven't done this for years and years and years we might do something like compare register 10 with the number 10 and I realized this is a slightly different style I just want to do this in case people are familiar with arm and then what I might do is move if greater than register 10 with number 10 there's no Branch anymore now I've just said do the comparison and then this move instruction doesn't do anything unless the comparison was greater than it's it's essentially a KN if that's not true through and so now my pipeline doesn't have to be restarted I just have a bubble of one instruction that either gets executed if I need it to or it doesn't get executed and just flows throughout the other side of the pipeline so I still haven't had to rester everything I haven't discarded the instruction after that's work and so that's a really smart way of for small sequences of instructions avoiding having to restart the whole pipeline just by saying well skip that one okay now we're good again so it's kind of similar to the the delay slot you're pushing the burden on the programmer a little bit but you given them a tool to be able to do this and then the last thing is to somehow teach the fetch robot at the front of the pipeline how to be Clairvoyant and to look into the future and predict and make some guess as to where the flow of execution is going to go even though that robot has no other information at all and then I think that's a whole other topic because that is how most modern processes work now and it's called Branch prediction it's a it's it's a whole very interesting area of of uh exploration and it's all in uh Aid of trying to keep this pipeline of fetch decode and execute filled with useful work that we don't have to throw away and in modern CPUs I've described it as three steps here but modern CPUs have hundreds of steps so it's even more important that you steer the front of the pipeline in the correct direction otherwise you spend most of your time waiting for this pipeline to fill back up again I imagine this kind of fire hose of data if not the opposite redundant data coming out the end it is exactly right and it is a fire hose these days because um another trick that that CPUs have up their sleeve is that they can you maybe they have what if you had two uh abacuses imagine you had two of them right now you could actually imagine having two of these robots at the end of the pipeline with their abacuses and if you can fill this pipeline up fast enough and you can see that there are two instructions one after another and neither of them refer to each other so one's adding 10 to the X register the other one's subtracting one from the Y register why don't I do them at the same time and my two robots at the end of the pipeline could do two instructions at the same time and in order to make that work I have to fill this pipeline up faster and faster and faster to get enough work to be able to do more than one thing at once and that's another reason why we have to make sure the pipeline is filled with useful work at all times and and that there's these complicated thousands of steps or tens of steps uh that happen it's so cool it's so amazing and we don't most of the time we don't think about it we don't know about it one last question on on that then um is there any issue with accessing the the pigeon holes uh you know so I'm thinking you've got a fetcher you've got someone that's trying to do some adding and storing and is are those things that need to be managed as well absolutely yes so there are two parts that I sort of glossed over a little bit here uh one is that yes maybe the pigeon hole only one robot at a time can access the pigeon hole and so if the executing robot needs to write back into the pigeon hole while the fetching robot is trying to fetch more things to put into the pigeon hole then yes you might be in trouble and that's solved by um giving the uh doing solved in two ways the fetcher can pick up lots of things at once from the pigeon hole so he goes to the pigeon holes and gets 16 of them in a row kind okay that gets me he cashes some of that and then yeah he also can have a cach uh up here where he's like well I keep fetching the same things over and over again why don't I just have my own private copy of what's in those pigeon holes and then I don't have to the pigeon holes as much and similarly on the executing side there are caches here this is why we typically have an instruction cache separate from a data cache so that we can have the two robots like not have to fight with each other so you're absolutely right there is definitely some things that have to be thought about there the other thing is that in this very simple um process where I said you know the one thing executes after another we don't have any things uh where um if it took for example multiple ticks of the to do the work maybe a multiply takes three ticks of the clock then um if one instruction uses the result of the instruction preceding it I still have to wait for the whole thing to finish beforehand and I have to sort of stop the pipeline while the multiply finishes before I can get the result out to then maybe add one to it at the end and though those kinds of those are called hazards data hazards and uh uh that there is definitely some circuitry that has to like look out for well I can't start the next instruction until the previous one has definitely finished if it depends on it and that also comes into play with our multiple robots over here so we've actually managed to get to about early 2000s era now of of CPUs so this was the Sega Dreamcast which is my favorite game console to work on had an Hitachi sh4 processor that could do two things at once and there were these rules about which two instructions the circuitry could detect would didn't interfere with each other and thus could run together and so you spent your entire time trying to write pairs of instructions that you knew would go together so that you got twice as many instructions get through getting through the pipeline each tick of the clock and then we're going to increment the index and incrementing the index moves it along to the next cell then we're going to say well add to the accumulator using our Abacus so sometimes the way this is done well again in this we're going to go to the\n"

CPU Pipeline - Computerphile

Random Videos