Computer Speeds - Computerphile

The Art of Pipeline Optimization: Understanding CPU Architecture and Execution Speed

As we delve into the world of computer architecture, it's essential to understand how CPUs execute instructions. A key concept in this realm is the pipeline, which allows multiple instructions to be processed simultaneously. By breaking down the execution process into smaller stages, we can analyze how CPU architects design these systems to achieve optimal performance.

In a typical pipeline, instructions are fetched from memory and then decoded. This decoding stage involves understanding the instruction's opcode, addressing mode, and any operands required for execution. The next step is to execute the instruction, which may involve arithmetic operations, load/store operations, or other processing tasks. Finally, the result of the instruction is stored back in memory.

However, this process doesn't happen in a single clock cycle. Instead, it's divided into multiple stages, each taking a specific amount of time to complete. The pipeline is designed such that while one stage is executing, others are waiting for their turn. This is known as pipelining, and it allows the CPU to execute more instructions in less time.

The key to successful pipelining is to avoid stalls, which occur when the pipeline is unable to make progress due to a dependency on another instruction or resource. These stalls can significantly slow down the execution speed of the CPU. To mitigate this, designers use various techniques, such as out-of-order execution, where instructions are reordered to minimize dependencies and reduce stalls.

Another crucial aspect of pipelining is the clock speed. By increasing the clock speed, the CPU can execute more instructions per second, leading to faster overall performance. However, there's a limit to how fast the clock speed can be increased due to the physical limitations of digital logic. As the clock speed approaches its maximum value, further increases become increasingly difficult.

To overcome this limitation, designers employ various strategies, such as increasing the length of the pipeline or adding additional stages to the execution process. For example, a six-stage pipeline might be used instead of a three-stage one, allowing more instructions to be processed in parallel without increasing the clock speed.

However, there's another challenge that arises when dealing with pipelining: bubbles. A bubble occurs when a stage in the pipeline is waiting for an instruction that's not yet available due to dependencies or other constraints. This can cause significant delays and reduce overall system performance. To mitigate this, designers use various techniques, such as branch prediction, which attempts to anticipate the outcome of branches and adjust the pipeline accordingly.

Superscalar architectures are another approach used to improve pipelining. By executing multiple instructions simultaneously, these designs aim to increase overall throughput without relying on increased clock speeds. However, this approach also requires careful management of dependencies and out-of-order execution to avoid stalls.

In conclusion, understanding CPU architecture and execution speed is crucial for designing efficient systems that can execute code quickly. The pipeline, with its various stages and techniques for managing dependencies and out-of-order execution, plays a critical role in achieving optimal performance. By employing strategies such as pipelining, superscalar architectures, and branch prediction, designers can create systems that not only run at higher clock speeds but also achieve better overall performance.

While increasing the clock speed is essential for improving performance, it's just one aspect of the equation. The design of the CPU architecture, including the pipeline, has a significant impact on how fast code runs. By carefully managing dependencies and out-of-order execution, designers can create systems that take full advantage of increased clock speeds, leading to faster overall system performance.

To illustrate this concept further, let's consider an example where we're executing two instructions simultaneously: `ADD A` and `MUL D`. We need the result of the multiplication to execute the addition. In a typical pipeline, we would fetch both instructions, decode them, and then execute the multiplication before executing the addition. However, by using superscalar architecture or out-of-order execution, we can squish these instructions together and execute them in parallel.

For instance, we might use a six-stage pipeline with separate stages for fetching, decoding, execution, and storing results. We could use four cycles to fetch both instructions, two cycles to decode them simultaneously, three cycles to execute the multiplication (using registers and addressing modes), and then one cycle to store the result of the addition back into memory. By squishing these instructions together, we can save time and improve overall performance.

Ultimately, designing efficient CPU architectures requires a deep understanding of pipelining, dependencies, out-of-order execution, and other techniques for managing system resources. By employing strategies like pipelining, superscalar architectures, branch prediction, and careful management of dependencies, designers can create systems that not only run at higher clock speeds but also achieve better overall performance.

"WEBVTTKind: captionsLanguage: enit's an interesting one it's sort of our two two gigahertz processors the same speed to understand this one you need to understand what the megahertz or the gigahertz or the kilohertz if we go back i'll fire often terahertz processors that'd be fast we'll go get a terahertz processor don't know anyway i digress we need to think about what that's actually describing we tend to think that we can describe the speed of repressor by looking at its gigahertz rating so one point six gigahertz processor is faster than one point five two gigahertz processor is faster than 1.6 and so on to an extent that's true but it only really fits if you've got the same model of processes but if you compare different even different iterations of intel's core i7 architecture or say core i7 compared to a ryzen chip than an amd ryzen chip then it breaks down and it's down to what the megahertz is describing and then how the cpu is built inside so the megahertz or the gigahertz is describing how fast the clock which synchronizes the computer runs a way to think about this is if you think about the conductor of an orchestra he's keeping time the orchestra is then playing in time that's what the clock in a cpu does it keeps track of time of what's happening and if you think back to the previous video i did about a couple of years ago on pipelines in a modern cpu you have a series of steps that goes through at the very simple we could label this as we've got a fetch step a decode step and an execute step this is similar to what the original arm is cpu had and so in the first clock cycle i'm going to use two colors here we start fetching the first instruction then in the second clock cycle we start decoding that instruction and fetching the second instruction then in the third clock cycle we actually get around to executing this one we decode the second and we start fetching the third and so on and this goes on so we then get the fourth we're decoding the third and we execute the second and so on this continues providing that we can actually satisfy things and we don't get any pipeline stalls this happens as long as we don't require part of the cpu that's here say to be here so for example a cpu can only access one thing from memory at a time then we can't fetch and fetch from memory at the same time in these stages we get a store watch the other video for that this may look like we're doing more than one thing at a time the answer is yes we are we're trying to speed up the execution of our cpu and realizing that actually when we're executing things we're not using this but normally so if we break things up into a pipeline then we can have all bits of the cpu happening we can think about these being synchronized to the clock we do this bit here in the first clock cycle this is the second clock cycle the third clock cycle the fourth clock cycle and so if we keep this structure we make our clock cycles shorter i run at higher megahertz speed high gigahertz speed then things get faster so this works fine and if we increase the clock speed then we can decrease the amount of time that each of these steps take but there becomes a limit because these are implemented in digital logic then after a while the logic itself will take up a certain amount of time and we won't be able to reduce it anymore because otherwise the clock speed would be ticking over before we'd finish running the logic is what's called the propagation delay in the digital logic so we've got a minimum amount of time and actually it'll probably be governed by one of these steps so there's going to be a limit on how fast we can get our clock speed based on the logic but we can get around that by actually making our pipeline longer so what we could say is let's say we break it down not into three steps but into six smaller steps and they would do parts of what was being done in here and so on so this one might fetch this might pop decode this might finish decoding this might get something from a register and and these two might do part of the execution and so i'm making up different things here there's various ways you can build these things and in this case we've got one two three four five six steps so it'll take longer for our pipeline to get full but when it does if we can run it full pelt we'll be able to run faster with a faster clock speed because each of these steps take up less time of course the problem is is if we then get a bubble in our pipeline then it'll take longer to refill so here we've got a bubble we perhaps had a one or two cycle delay here we'd have a four or five cycle delay if possible so if this was say running at one gigahertz and this was at 1.2 a five cycle delay would take longer than one cycle delay here and so on and so what you can see is that our cpu design the architecture the way the internal bits are built has as much influence on how fast the program runs as the clock speed and increasing the clock speed but changing the design doesn't necessarily mean that it'll be faster even though it's running faster now hopefully you can get it and sometimes you have to redesign your program to get that best advantage out of it so the changes in the architecture have an effect so what you need to do is design your cpu so that you try and avoid getting those bubbles in the pipeline that actually even though you've got a faster cpu because we're now executing these restrictions on a much quicker basis we want to keep the pipeline full and so what you end up doing is designing other bits of things things like superscalar architecture which we've talked about before where we can run more than one instruction at the same time you have out of order execution where you move things around to try and avoid the bubbles and to do that you rename registers and have more registers and things all sorts of things going on you have a branch predictor which is trying to make sure that we don't get the bubbles in the first place by choosing the right instructions and so on and all of that can have an influence on how fast your cpu actually runs the code as much as this clock speed so the clock speed does tell you how fast it is but you can't really use it to compare between different cpus of different types so we can execute that multiply d up there we think well okay can we do the ad at the same time we'll note because we need the result of that as well so we can then execute the ad down here before finally and it just fits on the paper like that so we can actually squash things up and we're going to save some timeit's an interesting one it's sort of our two two gigahertz processors the same speed to understand this one you need to understand what the megahertz or the gigahertz or the kilohertz if we go back i'll fire often terahertz processors that'd be fast we'll go get a terahertz processor don't know anyway i digress we need to think about what that's actually describing we tend to think that we can describe the speed of repressor by looking at its gigahertz rating so one point six gigahertz processor is faster than one point five two gigahertz processor is faster than 1.6 and so on to an extent that's true but it only really fits if you've got the same model of processes but if you compare different even different iterations of intel's core i7 architecture or say core i7 compared to a ryzen chip than an amd ryzen chip then it breaks down and it's down to what the megahertz is describing and then how the cpu is built inside so the megahertz or the gigahertz is describing how fast the clock which synchronizes the computer runs a way to think about this is if you think about the conductor of an orchestra he's keeping time the orchestra is then playing in time that's what the clock in a cpu does it keeps track of time of what's happening and if you think back to the previous video i did about a couple of years ago on pipelines in a modern cpu you have a series of steps that goes through at the very simple we could label this as we've got a fetch step a decode step and an execute step this is similar to what the original arm is cpu had and so in the first clock cycle i'm going to use two colors here we start fetching the first instruction then in the second clock cycle we start decoding that instruction and fetching the second instruction then in the third clock cycle we actually get around to executing this one we decode the second and we start fetching the third and so on and this goes on so we then get the fourth we're decoding the third and we execute the second and so on this continues providing that we can actually satisfy things and we don't get any pipeline stalls this happens as long as we don't require part of the cpu that's here say to be here so for example a cpu can only access one thing from memory at a time then we can't fetch and fetch from memory at the same time in these stages we get a store watch the other video for that this may look like we're doing more than one thing at a time the answer is yes we are we're trying to speed up the execution of our cpu and realizing that actually when we're executing things we're not using this but normally so if we break things up into a pipeline then we can have all bits of the cpu happening we can think about these being synchronized to the clock we do this bit here in the first clock cycle this is the second clock cycle the third clock cycle the fourth clock cycle and so if we keep this structure we make our clock cycles shorter i run at higher megahertz speed high gigahertz speed then things get faster so this works fine and if we increase the clock speed then we can decrease the amount of time that each of these steps take but there becomes a limit because these are implemented in digital logic then after a while the logic itself will take up a certain amount of time and we won't be able to reduce it anymore because otherwise the clock speed would be ticking over before we'd finish running the logic is what's called the propagation delay in the digital logic so we've got a minimum amount of time and actually it'll probably be governed by one of these steps so there's going to be a limit on how fast we can get our clock speed based on the logic but we can get around that by actually making our pipeline longer so what we could say is let's say we break it down not into three steps but into six smaller steps and they would do parts of what was being done in here and so on so this one might fetch this might pop decode this might finish decoding this might get something from a register and and these two might do part of the execution and so i'm making up different things here there's various ways you can build these things and in this case we've got one two three four five six steps so it'll take longer for our pipeline to get full but when it does if we can run it full pelt we'll be able to run faster with a faster clock speed because each of these steps take up less time of course the problem is is if we then get a bubble in our pipeline then it'll take longer to refill so here we've got a bubble we perhaps had a one or two cycle delay here we'd have a four or five cycle delay if possible so if this was say running at one gigahertz and this was at 1.2 a five cycle delay would take longer than one cycle delay here and so on and so what you can see is that our cpu design the architecture the way the internal bits are built has as much influence on how fast the program runs as the clock speed and increasing the clock speed but changing the design doesn't necessarily mean that it'll be faster even though it's running faster now hopefully you can get it and sometimes you have to redesign your program to get that best advantage out of it so the changes in the architecture have an effect so what you need to do is design your cpu so that you try and avoid getting those bubbles in the pipeline that actually even though you've got a faster cpu because we're now executing these restrictions on a much quicker basis we want to keep the pipeline full and so what you end up doing is designing other bits of things things like superscalar architecture which we've talked about before where we can run more than one instruction at the same time you have out of order execution where you move things around to try and avoid the bubbles and to do that you rename registers and have more registers and things all sorts of things going on you have a branch predictor which is trying to make sure that we don't get the bubbles in the first place by choosing the right instructions and so on and all of that can have an influence on how fast your cpu actually runs the code as much as this clock speed so the clock speed does tell you how fast it is but you can't really use it to compare between different cpus of different types so we can execute that multiply d up there we think well okay can we do the ad at the same time we'll note because we need the result of that as well so we can then execute the ad down here before finally and it just fits on the paper like that so we can actually squash things up and we're going to save some time\n"