Crashes, Cosmic Rays and Kernel Panic - Computerphile

**The Importance of Exception Handling in Operating Systems**

When it comes to writing code, developers often focus on the logic and functionality of their programs, but they rarely think about the potential errors that can occur. However, errors are an inherent part of the programming process, and operating systems are no exception. In this article, we will explore the concept of exceptions in operating systems, how they are handled, and what happens when a program crashes.

**Memory Access Errors**

One common type of error that occurs in programs is memory access errors. When a programmer writes to memory locations that belong to another part of the system, it can cause problems. In modern computers, there is a memory management unit (MMU) that partitions memory between different programs and catches such errors. The MMU says to the operating system, "Hey, this program is trying to do something it can't do." The operating system then stops the program and prevents further damage. This is just one example of how exceptions are handled in operating systems.

**Divide by Zero and Other Errors**

There are many types of errors that can occur in programs, including divide by zero. When a programmer attempts to divide by zero, there is no answer, and the computer cannot calculate it. To handle this situation, the operating system raises an exception, which is a way for the program to signal to the operating system that something has gone wrong. The exception is then caught by the operating system, and it takes action to prevent further damage.

**Crashing and Rebooting**

When a program crashes due to an error, the operating system can catch the exception and take steps to clean up the problem. However, if the error is severe enough, the operating system may panic, and the machine will reboot. This is often referred to as a kernel panic. The symptoms of a kernel panic vary depending on the operating system and hardware, but they usually involve a blue screen or error message.

**What Causes Kernel Panics?**

Kernel panics can be caused by a variety of factors, including memory errors, bugs in the code, and even cosmic rays hitting the memory. Cosmic rays are high-energy particles that can cause random changes to the memory cells, leading to errors. If the memory is faulty or if a cosmic ray hits it, it can cause the operating system to panic.

**Shutting Down Safely**

To prevent kernel panics from occurring in the first place, developers use techniques such as error checking and validation. They also write code that is robust and able to handle unexpected errors. When a program crashes due to an exception, the operating system can shut down safely by unmounting file systems, releasing resources, and taking other necessary steps.

**Debugging and Error Reporting**

When a kernel panic occurs, it's often difficult to determine what caused the problem. However, many modern operating systems provide useful error messages that can help developers diagnose the issue. In some cases, the error message may even point to a specific line of code or configuration setting that caused the problem.

**Exception Handling in Modern Operating Systems**

Modern operating systems have improved their exception handling capabilities, including the ability to detect and recover from errors more quickly. Some systems also use techniques such as checkpointing, which allows them to save the state of the system before crashing and then restore it when the machine is rebooted.

In conclusion, exceptions are an inherent part of programming and operating systems. By understanding how exceptions work and how they can be handled, developers can write more robust code that is less prone to errors. Additionally, modern operating systems have improved their exception handling capabilities, allowing them to recover from errors more quickly and safely.

"WEBVTTKind: captionsLanguage: enthis is the story of what happens when things go wrong in your computer so what we considered the heartbleed bug we looked at what happens when the computer program doesn't do exactly what we expected it to in that case the program had been written incorrectly so rather than returning the right amount of data it copied a whole load more data from the program to the other end of the system so the heart bleed bug is not the only thing that can happen when the computer goes wrong that one was quite unique in that the program appeared to be working correctly it carried on running but we've all had the situation where we're running a piece of software we do something and then suddenly it crashes and the program quits and in those cases what's happened is that the program has done something which the computer can't process it's done so much what we class as an exceptional thing and the computer has thrown an exception saying hang on i can't process it and there's various things that can cause that so i'm going to move to one of my older computers and we'll just demonstrate what happens so we're using here the atari falcon which was created in the early 90s and is roughly equivalent to the sort of power of machine that we'd get at that time the reason i want to use this rather than using a more modern system is that this system in its base stage as it came could only really run one program at a time so we can make it crash and we can see what happens a lot easier than if we use a more modern system and we can then talk about how that differs so what i've done is i've written a program in machine code which will basically crash the system on command it's very very simple we have a message that's printed out it's a bit like the sort of classic hello world program it prints out a message it then waits for a key press to happen it calls the operating system and says get me the next key and wait until that happens and then we have this line here which is what crashes the system and all this is trying to do is read from a memory location for which there is absolutely no memory in the system the cpu in this system can access up to four gigabytes worth of memory but this machine's only got four megabytes and about half a meg of rom chip in there as well so the memory address i'm gonna access here does not exist in terms of the computer's memory so when i run this program the computer can't satisfy the request we've asked a bit because there is no memory there it says press any key to crash the systems i'm going to press the spacebar and you'll see that very briefly there two bombs appeared on screen and this was the operating system telling us that the program had done something that he couldn't do and so the operating system would stop the program running and return control to the operating system so we can carry on now in that case the operating system was capable of recovering from the area it was something that it said hang on i can recover from this i've also written another program which is called bad crash which is a lot more destructive there we go so it's very similar we print out the same message wait for the key press again but here rather than just reading from one relocation that we couldn't it takes a whole lot of the computer's memory and fills it with zeros and i've chosen a point where the operating system keeps its temporary variables and so on so when i run this it's going to totally trash the system and we'll see what happens so i can assemble this up and if i now run this one pressing the key to crash the system i press the spacebar again and the system's frozen this time it hasn't even been able to print out the error message to say what's gone wrong that's because we've destroyed the operating system or parts of the operating system's data as well and it can't get out of it i can't even hit control delete to reset it i've got to physically hit the hardware reset button to bring it back up so one of the problems there is that we'd written to memory locations that belonged to something else in this case it was the operating system now on a modern computer and in fact on the cpu with this machine as well but it wasn't used in this operating system you have something called a memory management unit which can partition the memory between the different programs and can catch those sort of errors and say to the operating system hang on this program is trying to do this which you said it can't do and so the operating system can stop it it's not just memory access though there's all sorts of things that the program might try and do which it can't necessarily satisfy classic example would be to divide by zero there isn't really an answer to that possibly you might class it as infinity the computer can't calculate it so it raises what's called an exception and the operating systems are written so it'll do something to catch that exception so on this operating system it would print some bombs as we saw on the side of the screen on an old amiga you'd get a guru meditation on a mac you'll get a segmentation fault and you'll get various errors on windows in fact one was so common in one version of windows to make sure you never saw that again they renamed the error message that came up so that's fine when your program crashes the operating system can catch it and with a bit of help from the hardware clean it up and make sure it doesn't happen to it doesn't affect anything else but who watches the watches and what happens if there's a exception thrown by the cpu by the operating system well in that case the operating system can't recover because it doesn't know the correct state it's meant to be in and so it panics and so we get on our screen if we're using linux or a mac at what's called a kernel panic because the opticism can't continue on windows it's the infamous blue screen of death and generally it'll give you some information that's useful to a developer to see what's going wrong with your operating system you have to reset the machine and hopefully it'll start up again sometimes you're a little more unfortunate and you have to replace the whole machine it's worth bearing in mind that not all the exceptions that an operating system has to deal with are necessarily bad when you hit a key on the keyboard move your mouse or gets a network packet on your network that's something exceptional which the operating system has to deal with but it's able to process that and then carry on doing whatever it was doing beforehand occasionally you'll get something happening which just completely destroys the operation's ability to continue now it's easy to understand why that might happen if you've got a bug in your code like we saw with heartbleed but sometimes you'll be sitting using the machine and it's been fined for years weeks months whatever it is and suddenly you'll get a kernel panic and it's helpful then to think about what might have changed perhaps if you're a linux or bsd freak and you're regularly uh compiling your kernel then you may have misconfigured it and so you put in a new kernel and suddenly you started up and it kernel panics if you've just installed a new piece of hardware it's possible the drivers for that or the hardware itself is incompatible or grating with the system and causing two parts of your operating system to not work as well together but occasionally just for no reason at all your computer will start kernel panicking and you won't necessarily know what causes it well the common cause is perhaps the memory if the memory starts to fail but occasionally even if the memory is working correctly there may still be a mistake in the memory if a cosmic ray hits the actual memory cell then it can flip the bit from a zero to a one or a one to a zero this is more likely to happen if you're higher up so people in colorado for example are more likely to suffer from this on the computer systems or a satellite in space is likely to have problems like this there's not much you can do to prevent this cosmic ray is going to hit it if it's going to hit it you can shield it perhaps with a bit of metal but you could also use memory that actually can detect that the narrows happen so one of the things you see is error correcting memory or ecc memory but even so if you still can get to the point where the operating system gets corrupted and can't continue as you'd expect and then it's just going to have to colonel panic when a and b transmit they have to arrange that when their packets arrive at the base station that they arrive this one in this time slot and this one in this time slot i've attached the square of side two directly to the south as it were of the two boxes of side onethis is the story of what happens when things go wrong in your computer so what we considered the heartbleed bug we looked at what happens when the computer program doesn't do exactly what we expected it to in that case the program had been written incorrectly so rather than returning the right amount of data it copied a whole load more data from the program to the other end of the system so the heart bleed bug is not the only thing that can happen when the computer goes wrong that one was quite unique in that the program appeared to be working correctly it carried on running but we've all had the situation where we're running a piece of software we do something and then suddenly it crashes and the program quits and in those cases what's happened is that the program has done something which the computer can't process it's done so much what we class as an exceptional thing and the computer has thrown an exception saying hang on i can't process it and there's various things that can cause that so i'm going to move to one of my older computers and we'll just demonstrate what happens so we're using here the atari falcon which was created in the early 90s and is roughly equivalent to the sort of power of machine that we'd get at that time the reason i want to use this rather than using a more modern system is that this system in its base stage as it came could only really run one program at a time so we can make it crash and we can see what happens a lot easier than if we use a more modern system and we can then talk about how that differs so what i've done is i've written a program in machine code which will basically crash the system on command it's very very simple we have a message that's printed out it's a bit like the sort of classic hello world program it prints out a message it then waits for a key press to happen it calls the operating system and says get me the next key and wait until that happens and then we have this line here which is what crashes the system and all this is trying to do is read from a memory location for which there is absolutely no memory in the system the cpu in this system can access up to four gigabytes worth of memory but this machine's only got four megabytes and about half a meg of rom chip in there as well so the memory address i'm gonna access here does not exist in terms of the computer's memory so when i run this program the computer can't satisfy the request we've asked a bit because there is no memory there it says press any key to crash the systems i'm going to press the spacebar and you'll see that very briefly there two bombs appeared on screen and this was the operating system telling us that the program had done something that he couldn't do and so the operating system would stop the program running and return control to the operating system so we can carry on now in that case the operating system was capable of recovering from the area it was something that it said hang on i can recover from this i've also written another program which is called bad crash which is a lot more destructive there we go so it's very similar we print out the same message wait for the key press again but here rather than just reading from one relocation that we couldn't it takes a whole lot of the computer's memory and fills it with zeros and i've chosen a point where the operating system keeps its temporary variables and so on so when i run this it's going to totally trash the system and we'll see what happens so i can assemble this up and if i now run this one pressing the key to crash the system i press the spacebar again and the system's frozen this time it hasn't even been able to print out the error message to say what's gone wrong that's because we've destroyed the operating system or parts of the operating system's data as well and it can't get out of it i can't even hit control delete to reset it i've got to physically hit the hardware reset button to bring it back up so one of the problems there is that we'd written to memory locations that belonged to something else in this case it was the operating system now on a modern computer and in fact on the cpu with this machine as well but it wasn't used in this operating system you have something called a memory management unit which can partition the memory between the different programs and can catch those sort of errors and say to the operating system hang on this program is trying to do this which you said it can't do and so the operating system can stop it it's not just memory access though there's all sorts of things that the program might try and do which it can't necessarily satisfy classic example would be to divide by zero there isn't really an answer to that possibly you might class it as infinity the computer can't calculate it so it raises what's called an exception and the operating systems are written so it'll do something to catch that exception so on this operating system it would print some bombs as we saw on the side of the screen on an old amiga you'd get a guru meditation on a mac you'll get a segmentation fault and you'll get various errors on windows in fact one was so common in one version of windows to make sure you never saw that again they renamed the error message that came up so that's fine when your program crashes the operating system can catch it and with a bit of help from the hardware clean it up and make sure it doesn't happen to it doesn't affect anything else but who watches the watches and what happens if there's a exception thrown by the cpu by the operating system well in that case the operating system can't recover because it doesn't know the correct state it's meant to be in and so it panics and so we get on our screen if we're using linux or a mac at what's called a kernel panic because the opticism can't continue on windows it's the infamous blue screen of death and generally it'll give you some information that's useful to a developer to see what's going wrong with your operating system you have to reset the machine and hopefully it'll start up again sometimes you're a little more unfortunate and you have to replace the whole machine it's worth bearing in mind that not all the exceptions that an operating system has to deal with are necessarily bad when you hit a key on the keyboard move your mouse or gets a network packet on your network that's something exceptional which the operating system has to deal with but it's able to process that and then carry on doing whatever it was doing beforehand occasionally you'll get something happening which just completely destroys the operation's ability to continue now it's easy to understand why that might happen if you've got a bug in your code like we saw with heartbleed but sometimes you'll be sitting using the machine and it's been fined for years weeks months whatever it is and suddenly you'll get a kernel panic and it's helpful then to think about what might have changed perhaps if you're a linux or bsd freak and you're regularly uh compiling your kernel then you may have misconfigured it and so you put in a new kernel and suddenly you started up and it kernel panics if you've just installed a new piece of hardware it's possible the drivers for that or the hardware itself is incompatible or grating with the system and causing two parts of your operating system to not work as well together but occasionally just for no reason at all your computer will start kernel panicking and you won't necessarily know what causes it well the common cause is perhaps the memory if the memory starts to fail but occasionally even if the memory is working correctly there may still be a mistake in the memory if a cosmic ray hits the actual memory cell then it can flip the bit from a zero to a one or a one to a zero this is more likely to happen if you're higher up so people in colorado for example are more likely to suffer from this on the computer systems or a satellite in space is likely to have problems like this there's not much you can do to prevent this cosmic ray is going to hit it if it's going to hit it you can shield it perhaps with a bit of metal but you could also use memory that actually can detect that the narrows happen so one of the things you see is error correcting memory or ecc memory but even so if you still can get to the point where the operating system gets corrupted and can't continue as you'd expect and then it's just going to have to colonel panic when a and b transmit they have to arrange that when their packets arrive at the base station that they arrive this one in this time slot and this one in this time slot i've attached the square of side two directly to the south as it were of the two boxes of side one\n"