Inside the CPU - Computerphile

The Fetch, Decode, and Execute Logic of a CPU

A typical computer's central processing unit (CPU) works by going through three phases: fetch, decode, and execute. The process begins with fetching an instruction from memory. This is done using the address bus to access the memory location where the instruction is stored. Once the instruction has been fetched, it can be decoded, which involves determining what operation needs to be performed on the data that was loaded into the CPU's registers. Finally, the instruction is executed by performing the required operation on the data.

The Fetch Phase: The First Step in Executing an Instruction

The fetch phase is where the CPU first accesses the memory location where the instruction is stored. This involves using the address bus to retrieve the instruction from memory. In a typical computer, there are many different bits that make up the CPU, but only a few of them are actually doing any work at any given time. While the CPU is fetching an instruction, it cannot execute or decode anything else because those processes require access to the same resources as fetching.

Once the fetch phase has completed, the decoding phase can begin. This involves taking the instruction that was just fetched and determining what operation needs to be performed on the data that was loaded into the registers. The decoding logic is responsible for saying "put the address for where you want to get the value to load into memory from" and then storing it once it's been loaded into the CPU.

The Fetch, Decode, and Execute Process: A Pipelined Model

However, this traditional fetch-decode-execute model can be inefficient. It can take many clock cycles for an instruction to complete because each phase takes a certain amount of time. To improve efficiency, a pipelined model can be used. In this model, the CPU still has three phases: fetch, decode, and execute, but these phases are not sequential. Instead, different parts of the CPU work on different instructions simultaneously.

For example, if an instruction is being fetched, the decoding phase can start working on it before the fetching phase is even complete. This means that while one instruction is still in the process of being decoded, another instruction can begin the fetch and decode phases. By doing so, the pipelined model allows multiple instructions to be processed simultaneously.

The Pipelining Process: How it Works

In a pipelined CPU, the different parts of the CPU work together to process instructions in parallel. The first step is to decide which instructions should be fetched and decoded in each clock cycle. This is done by analyzing the instruction queue, which holds the instructions that are waiting to be executed.

Once the instructions for the current clock cycle have been selected, the pipelining process can begin. In this cycle, some parts of the CPU will fetch an instruction from memory, while others will start decoding it. Other parts of the CPU may also begin executing an instruction, if possible.

The key advantage of the pipelined model is that multiple instructions can be processed simultaneously, which reduces the average time required to execute an instruction. However, there are potential drawbacks to this approach. For example, some instructions require access to both the memory and registers, but in a pipelined system, these resources may not be available at the same time.

Pipeline Hazards: The Drawback of Pipelining

One of the main challenges with pipelining is dealing with pipeline hazards. A pipeline hazard occurs when an instruction that requires a resource (such as a memory address) cannot be executed until the required resource becomes available. This can cause the CPU to stall, which means it must wait for a certain amount of time before continuing.

In the example above, if an instruction requires storing data in memory while another instruction is executing, there will be a pipeline hazard because the CPU cannot fetch and execute both instructions simultaneously. As a result, some instructions may need to wait for other instructions to complete before they can begin executing. This can lead to increased overall processing time.

The Benefits of Pipelining

Despite the potential drawbacks, pipelining offers several benefits. By dividing the CPU into multiple stages that work on different parts of an instruction simultaneously, pipelining allows for significant improvements in instruction-level parallelism. This means that more instructions can be processed per clock cycle than would be possible with a sequential fetch-decode-execute model.

Pipelining also enables the development of highly optimized and efficient CPUs. By optimizing each stage of the pipeline individually, designers can create a CPU that is able to execute instructions quickly and efficiently.

The Future of Pipelining: How Far Can We Go?

While pipelining has come a long way since its introduction in the 1980s, there are still challenges to overcome before it becomes even more widespread. One major challenge is dealing with complex instructions that require multiple resources. However, researchers have been exploring new architectures and techniques that can mitigate these issues.

One potential direction for future research is into the use of out-of-order execution. This involves allowing the CPU to execute instructions in any order that makes sense, rather than always following a strict sequence based on their operation. By doing so, the CPU can potentially reduce the number of pipeline hazards and improve overall performance.

In conclusion, the fetch-decode-execute model of a CPU is an essential part of modern computing. While traditional models have limitations, pipelining offers several benefits, including improved instruction-level parallelism and the ability to process multiple instructions simultaneously. However, there are still challenges to overcome before pipelining becomes even more widespread in the future.

"WEBVTTKind: captionsLanguage: enin a previous video we looked at how cpus can use caches to speed up accesses to memory so cpu has to fetch things from memory it might be a bit of data it might be instruction and it goes through the cache to try and access it and the cache keeps a local copy and fast memory to try and speed up the accesses but what we didn't talk about is what does the cpu do with what it's fetched from memory what's it actually doing and how does it process it so the cpu is fetching values from memory we'll ignore the cache for now because it doesn't matter whether the cpu has got a cache or not it's still going to do roughly the same things and we're also going to look at very old cpus the sort of things that we're in eight bit machines purely because they're simpler to deal with and simply to see what's going on but the same ideas still apply to an arm cpu today or your x86 chip or whatever it is you've got in your machine modern cpus use what's called the von neumann architecture what this basically means is that you have a cpu and you have a block of memory and that memory is connected to the cpu by two buses these are just a collection of several wires that are connecting and again we're looking at old-fashioned machines on a modern machine that gets a bit more complicated but the idea the principles the same so we have an address bus and the idea is that the cpu can generate a number in here in binary to access any particular value in here so we'll say that the first one is an address zero and we're going to use a 6 502 as an example we'll say that the last one is address 6 5 5 3 5 in decimal or f f f f in hexadecimal so we can generate any of these numbers on the 16 bits of this address bus to access any of the individual bytes in this memory how do we get the data between the two well we have another bus which is called the data bus which connects the two together now the reason why is a von neumann machine is because this memory can contain both the program i the bytes that make up the instructions that the cpu is going to execute and the data so the same block of memory contains some bytes which contain program instructions some bytes which contain data and the cpu if you wanted to could shoot the programmers data or treat the data as program although if you do that it will probably crash so what we've got here is an old bbc market uses a 6502 cpu and we're going to just write a very very simple machine code program that uses one of the operating systems just to print out the letter c for computer file so if we assemble it we're using hexadecimal we've started our program at 084c so that's the address where our program's been created and our program is very simple it loads one of the cpu's registers which is just basically a temporary data store that you can use and this one is called the accumulator with the ascii code 67 which represents a capital c and then it says jump to the subroutine at this address which will print out that particular character and then we tell it we want to stop so we've got to return from subroutine and if we run this and type in the address of where i've put it 0 8 4c then you'll see it prints out the letter c and then we get the prompt to carry on doing things so our program we write it in assembly language which we can understand as humans ish lda load accumulator jsr jump to subroutine rts return to subroutine you get the idea once you've done it a few times and the computer converts this into a series of numbers in binary cpus work in binary but to make it easier to read we display them in hexadecimal so our program becomes a 9 4 3 2 0 ee ff60 that's the program we've written and the cpu when it runs it needs to fetch those bytes from memory into the cpu now how does it do that so to get the first byte we need to put the address 084c on the address bus and a bit later on the memory will send back the byte that represents the instruction a9 now how does the cpu know where to get the instructions from well it's quite simple inside the cpu there is a register which we call the program counter or pc on a 6502 on something like a x86 machine it's known as the instruction point it has different names it doesn't make any difference and all that does is store the address of the next instruction to execute so when we were setting up here it would have 0 8 4c in it that's the address of the instruction we want to execute so when the cpu wants to fetch the instruction that it's going to execute it puts that address on the address bus and the memory then sends the instruction back to the cpu so the first thing the cpu's got to do to run our program is to fetch the instruction and the way it does that is by putting the address from the program counter onto the address bus and then fetching the actual instruction so the memory provides it but the cpu then reads that in on its input on the databus now it needs to fetch the whole instruction that cpu is going to execute and on the example we saw there it was relatively straightforward because the instruction was only a byte long not all cpus are that simple some cpus and will vary these things so this hardware can actually be quite complicated because it needs to work out how long the instruction is so it could be as short as one byte it could be as long on some cpus as 15 bytes and you sometimes don't know how long it's going to be until you've read a few of the bytes so this hardware can be relatively trivial so an arm cpu makes it very very simple it says all instructions are 32 bits long so the archimedes over there can fetch the instruction very very simply 32 bits on something like an x86 it could be any length up to 15 bytes or so and so this becomes more complicated and you have to sort of work out what it is until you've got it but we fetched the instruction so in the example we've got we've got a9 here so we now need to work out what a9 does well we do that we need to decode it into what we want the cpu to actually do so we need to have another bit of our cpus hardware which is dedicating to decoding the instructions so we have a part of the cpu which is fetching it and part of the cpu which is then decoding it so it gets a9 into it so the a9 comes into the decode and it says well okay that's a that's a load instruction and so i need to fetch a value from memory which was the 4-3 the ascii code for the capital letter c that we saw earlier so we need to fetch something else from memory so we need to access memory again and we need to work out what address that's going to be we also then need to once we've got that value update the right register to store that value so we've got to do things in sequence so part of the code log is to take the single instruction byte or however long it is and work out what's the sequence that we need to drive the other bits of the cpu to do and so that also means that we have another bit of the cpu which is the actual bit that does things which is going to be all the logic which actually executes instructions so we start off by fetching it and then once we've fetched it we can start decoding it and then we can execute it and the decode logic is responsible for saying put the address for where you want to get the value to load into memory from and then store it once it's been loaded into the cpu so you're doing things in order we have to fetch it first and we can't decode it until we fetched it and we can't execute things until we've decoded it so at any one time we'd probably find on a simple cpu that quite a few bits of the cpu wouldn't actually be doing anything so while we're fetching the value for memory to work out we're going to how we're going to decode it the decode and the execute logic aren't doing anything they're just sitting there waiting for their turn and then when we decode it it's not fetching anything and it's not executing anything so we're sort of moving through these different states one after the other and they'll take different amounts of time if we're fetching 15 bytes that's going to take longer than if we're fetching one if we're decoding it might well be shorter than if we're fetching something from memory because it's all inside the cpu and the execution depends on what's actually happening so your cpu will work like this and it will go through each phase then once it's done that it'll start on the next clock tick all the cpus are synchronized to a clock which just keeps things moving in sequence and you can build a cpu something like the 6502 worked like that but as we said lots of the cpu aren't actually doing anything at any time which is a bit wasteful of the resources so is there another way you can do this and the answer is yes you can do what's called a sort of pipelined model for a cpu so what you do here is you still have the same three bits of the cpu but you say okay so i'm going to fetch and i'll just use an f instruction one in the next bit of time i'm going to start decoding this one so i'm going to start decoding instruction one but i'm going to say i'm not using the fetch logic here so i'm going to have this start to get things ready i'm going to start to do things ahead of schedule i'm also at the same time going to fetch instruction two so now i'm doing two things two bits of my cpu in user zone i'm effecting the next instruction while decoding the first one then once we've done the decoding i can start executing the first instruction so we'll execute that but at the same time i can start decoding instruction two and hopefully i can start fetching instruction three so what still taken the same amount of time to execute that first instruction but the beauty is when it comes to execute instruction two it completes exactly one cycle after the other rather than having to wait for it to go through the fetch and decode and execute cycles we can just execute it as soon as we finish instruction one so each instruction still takes the same amount of time still takes say three clock cycles to go through the cpu but because we've sort of pipelined them together they actually appear to execute one after each other so it appears to exceed one clock cycle after each other and we could do this again so we can start decoding instruction three here at the same time executing instruction two now there can be problems this works for some instructions but say this instruction said store this value in memory now you've got a problem you've only got one address bus and one database so you can only access or store one thing in memory at a time you can't execute store instruction and fetch your value from memory so you won't be able to fetch it until the next clock cycle so it would fetch instruction 4 there while executing instruction 3 but we can't decode anything here so in this clock cycle we can decode instruction four and fetch instruction five but we can't execute anything we've got what's called a bubble in our pipeline or pipeline store because at this point the design of the cpu doesn't let us fetch an instruction and execute an instruction at the same time it's a one cycle what we call pipeline hazards that you can get when designing a pipeline cpu because the design of the cpu doesn't let you do the things you need to do at the same time at the same time and so you have to delay things which means that you get a bubble and so you can't quite get up to one instruction per cycle efficiency but you can certainly get closer than you could if you just had everything do one instruction at a time that it has to add this content is really very narrow i think it's the equivalent of a 15-inch screen at normal distance so really my field of view of augmented content and this sounds bad but it's not that badin a previous video we looked at how cpus can use caches to speed up accesses to memory so cpu has to fetch things from memory it might be a bit of data it might be instruction and it goes through the cache to try and access it and the cache keeps a local copy and fast memory to try and speed up the accesses but what we didn't talk about is what does the cpu do with what it's fetched from memory what's it actually doing and how does it process it so the cpu is fetching values from memory we'll ignore the cache for now because it doesn't matter whether the cpu has got a cache or not it's still going to do roughly the same things and we're also going to look at very old cpus the sort of things that we're in eight bit machines purely because they're simpler to deal with and simply to see what's going on but the same ideas still apply to an arm cpu today or your x86 chip or whatever it is you've got in your machine modern cpus use what's called the von neumann architecture what this basically means is that you have a cpu and you have a block of memory and that memory is connected to the cpu by two buses these are just a collection of several wires that are connecting and again we're looking at old-fashioned machines on a modern machine that gets a bit more complicated but the idea the principles the same so we have an address bus and the idea is that the cpu can generate a number in here in binary to access any particular value in here so we'll say that the first one is an address zero and we're going to use a 6 502 as an example we'll say that the last one is address 6 5 5 3 5 in decimal or f f f f in hexadecimal so we can generate any of these numbers on the 16 bits of this address bus to access any of the individual bytes in this memory how do we get the data between the two well we have another bus which is called the data bus which connects the two together now the reason why is a von neumann machine is because this memory can contain both the program i the bytes that make up the instructions that the cpu is going to execute and the data so the same block of memory contains some bytes which contain program instructions some bytes which contain data and the cpu if you wanted to could shoot the programmers data or treat the data as program although if you do that it will probably crash so what we've got here is an old bbc market uses a 6502 cpu and we're going to just write a very very simple machine code program that uses one of the operating systems just to print out the letter c for computer file so if we assemble it we're using hexadecimal we've started our program at 084c so that's the address where our program's been created and our program is very simple it loads one of the cpu's registers which is just basically a temporary data store that you can use and this one is called the accumulator with the ascii code 67 which represents a capital c and then it says jump to the subroutine at this address which will print out that particular character and then we tell it we want to stop so we've got to return from subroutine and if we run this and type in the address of where i've put it 0 8 4c then you'll see it prints out the letter c and then we get the prompt to carry on doing things so our program we write it in assembly language which we can understand as humans ish lda load accumulator jsr jump to subroutine rts return to subroutine you get the idea once you've done it a few times and the computer converts this into a series of numbers in binary cpus work in binary but to make it easier to read we display them in hexadecimal so our program becomes a 9 4 3 2 0 ee ff60 that's the program we've written and the cpu when it runs it needs to fetch those bytes from memory into the cpu now how does it do that so to get the first byte we need to put the address 084c on the address bus and a bit later on the memory will send back the byte that represents the instruction a9 now how does the cpu know where to get the instructions from well it's quite simple inside the cpu there is a register which we call the program counter or pc on a 6502 on something like a x86 machine it's known as the instruction point it has different names it doesn't make any difference and all that does is store the address of the next instruction to execute so when we were setting up here it would have 0 8 4c in it that's the address of the instruction we want to execute so when the cpu wants to fetch the instruction that it's going to execute it puts that address on the address bus and the memory then sends the instruction back to the cpu so the first thing the cpu's got to do to run our program is to fetch the instruction and the way it does that is by putting the address from the program counter onto the address bus and then fetching the actual instruction so the memory provides it but the cpu then reads that in on its input on the databus now it needs to fetch the whole instruction that cpu is going to execute and on the example we saw there it was relatively straightforward because the instruction was only a byte long not all cpus are that simple some cpus and will vary these things so this hardware can actually be quite complicated because it needs to work out how long the instruction is so it could be as short as one byte it could be as long on some cpus as 15 bytes and you sometimes don't know how long it's going to be until you've read a few of the bytes so this hardware can be relatively trivial so an arm cpu makes it very very simple it says all instructions are 32 bits long so the archimedes over there can fetch the instruction very very simply 32 bits on something like an x86 it could be any length up to 15 bytes or so and so this becomes more complicated and you have to sort of work out what it is until you've got it but we fetched the instruction so in the example we've got we've got a9 here so we now need to work out what a9 does well we do that we need to decode it into what we want the cpu to actually do so we need to have another bit of our cpus hardware which is dedicating to decoding the instructions so we have a part of the cpu which is fetching it and part of the cpu which is then decoding it so it gets a9 into it so the a9 comes into the decode and it says well okay that's a that's a load instruction and so i need to fetch a value from memory which was the 4-3 the ascii code for the capital letter c that we saw earlier so we need to fetch something else from memory so we need to access memory again and we need to work out what address that's going to be we also then need to once we've got that value update the right register to store that value so we've got to do things in sequence so part of the code log is to take the single instruction byte or however long it is and work out what's the sequence that we need to drive the other bits of the cpu to do and so that also means that we have another bit of the cpu which is the actual bit that does things which is going to be all the logic which actually executes instructions so we start off by fetching it and then once we've fetched it we can start decoding it and then we can execute it and the decode logic is responsible for saying put the address for where you want to get the value to load into memory from and then store it once it's been loaded into the cpu so you're doing things in order we have to fetch it first and we can't decode it until we fetched it and we can't execute things until we've decoded it so at any one time we'd probably find on a simple cpu that quite a few bits of the cpu wouldn't actually be doing anything so while we're fetching the value for memory to work out we're going to how we're going to decode it the decode and the execute logic aren't doing anything they're just sitting there waiting for their turn and then when we decode it it's not fetching anything and it's not executing anything so we're sort of moving through these different states one after the other and they'll take different amounts of time if we're fetching 15 bytes that's going to take longer than if we're fetching one if we're decoding it might well be shorter than if we're fetching something from memory because it's all inside the cpu and the execution depends on what's actually happening so your cpu will work like this and it will go through each phase then once it's done that it'll start on the next clock tick all the cpus are synchronized to a clock which just keeps things moving in sequence and you can build a cpu something like the 6502 worked like that but as we said lots of the cpu aren't actually doing anything at any time which is a bit wasteful of the resources so is there another way you can do this and the answer is yes you can do what's called a sort of pipelined model for a cpu so what you do here is you still have the same three bits of the cpu but you say okay so i'm going to fetch and i'll just use an f instruction one in the next bit of time i'm going to start decoding this one so i'm going to start decoding instruction one but i'm going to say i'm not using the fetch logic here so i'm going to have this start to get things ready i'm going to start to do things ahead of schedule i'm also at the same time going to fetch instruction two so now i'm doing two things two bits of my cpu in user zone i'm effecting the next instruction while decoding the first one then once we've done the decoding i can start executing the first instruction so we'll execute that but at the same time i can start decoding instruction two and hopefully i can start fetching instruction three so what still taken the same amount of time to execute that first instruction but the beauty is when it comes to execute instruction two it completes exactly one cycle after the other rather than having to wait for it to go through the fetch and decode and execute cycles we can just execute it as soon as we finish instruction one so each instruction still takes the same amount of time still takes say three clock cycles to go through the cpu but because we've sort of pipelined them together they actually appear to execute one after each other so it appears to exceed one clock cycle after each other and we could do this again so we can start decoding instruction three here at the same time executing instruction two now there can be problems this works for some instructions but say this instruction said store this value in memory now you've got a problem you've only got one address bus and one database so you can only access or store one thing in memory at a time you can't execute store instruction and fetch your value from memory so you won't be able to fetch it until the next clock cycle so it would fetch instruction 4 there while executing instruction 3 but we can't decode anything here so in this clock cycle we can decode instruction four and fetch instruction five but we can't execute anything we've got what's called a bubble in our pipeline or pipeline store because at this point the design of the cpu doesn't let us fetch an instruction and execute an instruction at the same time it's a one cycle what we call pipeline hazards that you can get when designing a pipeline cpu because the design of the cpu doesn't let you do the things you need to do at the same time at the same time and so you have to delay things which means that you get a bubble and so you can't quite get up to one instruction per cycle efficiency but you can certainly get closer than you could if you just had everything do one instruction at a time that it has to add this content is really very narrow i think it's the equivalent of a 15-inch screen at normal distance so really my field of view of augmented content and this sounds bad but it's not that bad\n"