Cleaning Up Incorrectly Labelled Data (C3W2L02)

Error Analysis in Machine Learning: A Comprehensive Guide

In machine learning, error analysis is an essential process that helps identify and understand the mistakes made by algorithms. This process involves examining the errors of the algorithm on both the development (DEV) set and the test set to gain insights into its performance and accuracy. The goal of error analysis is to improve the overall quality of the algorithm by identifying areas where it needs improvement.

In an ideal scenario, the DEV set should have a similar distribution to the test set. This means that if you apply any corrections or modifications to the labels in the DEV set, they should also be applied to the test set. The reason for this is that you want your algorithm to generalize well to new, unseen data. If the DEV and test sets come from different distributions, it may indicate that there are issues with the algorithm's ability to handle out-of-distribution data.

One key principle of error analysis is to apply the same process to both the DEV and test sets simultaneously. This ensures that any corrections or modifications made to the labels in one set also apply to the other. By doing so, you can create a more consistent and reliable testing environment for your algorithm.

Another important aspect of error analysis is examining examples where the algorithm got wrong, as well as those where it got right. It's easy to focus solely on correcting examples that the algorithm got wrong, but this can lead to an unfair advantage in terms of bias. By also double-checking the accuracy of examples where the algorithm got right, you can ensure that you're not inadvertently introducing biases into your dataset.

However, not all datasets are created equal, and some may require more manual analysis than others. If your DEV set is small compared to your test set, it's often sufficient to focus on correcting labels in both sets simultaneously. However, if your training data has a different distribution than your DEV and test sets, it's reasonable to prioritize corrections for the training set. In this case, you may decide not to apply all corrections to the labels in the DEV and test sets.

In building practical machine learning systems, manual analysis and human insight often play a crucial role. While deep learning algorithms can learn from large amounts of data, they still require careful consideration and oversight. One common misconception is that error analysis is simply a matter of feeding data to an algorithm and training it without much additional effort. However, this approach neglects the importance of manual analysis and human insight in ensuring the accuracy and reliability of machine learning systems.

To gain a deeper understanding of your algorithm's performance, consider setting aside time to manually examine examples where it made mistakes. This can help you identify areas for improvement and prioritize corrections accordingly. By taking the time to carefully review data, you can make more informed decisions about how to optimize your algorithm and improve its overall accuracy.

In conclusion, error analysis is a critical component of machine learning, and it requires careful consideration and manual oversight. By applying the same process to both the DEV and test sets, examining examples where the algorithm got right as well as wrong, and being mindful of differences in dataset distributions, you can create a more robust and accurate machine learning system.

"WEBVTTKind: captionsLanguage: enthe data for your supervised learning problem comprises input X and output labels Y what have you going through your data you find that some of these upper labels Y are incorrect the up data which is incorrectly labeled is it worth your while to go in to fix up some of these labels let's take a look in the classification problem y equals 1 for cats and 0 for non cats so let's say you're looking through some data and that's a cat that smelly cat does the cat there's a cat that's not a cat that's a cat you know wait that's actually not a cat so this is an example with an incorrect label so I've used the term mislabeled examples to refer to if your learning algorithm I'll put the wrong value of Y but I'm going to say incorrectly labeled examples when to refer to if in the data set you have in a training set or the death set or the test set the label for Y whenever a human label assigned to this piece of data is actually incorrect and that's actually a dog so that Y really should have been 0 but maybe the laborer got that one wrong so if you find that your data has some incorrectly labeled examples what should you do well first let's consider the training set it turns out that deep learning algorithms are quite robust to random errors in the training set so so long as your errors or your incorrectly labeled examples so once those errors are not too far from random you know maybe sometimes on the labourer just wasn't paying attention or they accidentally randomly hit the wrong key on keyboard if the errors are reasonably random then it's probably ok to just leave the errors as they are and not spend too much time fixing them there's certainly no harm to going into your training set and be examining the labels and fixing them sometimes that is worth doing but your alpha might be okay even if you don't so long as the total data set size is big enough and actual percentage of errors is you know maybe not too high so I see a lot of machine learning algorithms that are trained even when we know that there are a few it's mistakes in the training side labels and usually works ok there is one caveat to this which is that people only albums are robust random errors they are less robust to systematic errors so for example if your labels consistently labeled white dogs as we had then that is a problem because your classifier will learn to classify on white colored dogs as cats for random errors or near random errors are usually not too bad for most deep learning algorithms now this discussion has focus on what to do about incorrectly labeled examples in your training set how about incorrectly labeled examples in your def set or test set if you're worried about the impact of incorrectly labeled examples on your deaf set or test set what I recommend you do is drink error analysis to add one extra column so that you can also count up the number of examples where the label Y was in your eggs so for example maybe when you count up the impact on a hundreds label deficit examples so you're going to find 100 examples where your concise output disagrees with the label in your def set and sometimes for a few of those examples your classifier disagrees with the label because the label was wrong rather than because you caused by wrong so maybe in this example you find that the label amidst a cat in the background so um put a check mark there to signify that example 98 had an incorrect label and maybe for this one the picture is actually a picture of a drawing of a cat garden in a row cat and maybe you want the neighborhoods of label that y equals zero while rather than y equals one and so you put another check mark there and just as you count up the percent of errors due to other categories like we saw in a previous video you'd also count up percentage of errors due to incorrect labels whether Y value in the DEF set was wrong and that accounted for while your learning algorithm made the prediction that differed from what the label on your data says so the question now is is it worthwhile going in to try to fix up you know this six percent of the incorrectly labeled examples my advice is if it makes a significant difference to your ability to evaluate algorithms on your def set then go ahead and spend the time to fix the incorrect labels but if it doesn't make a significant difference to your ability to use the DEF set to value classifiers then it might not be the best use of your time let me show you an example that illustrates what I mean by this so three numbers I recommend you look at to try to decide if it's worth going in and reducing the number of miss labor examples are the following recommend you look at the overall def set error and so in the example we had from the previous video we said that maybe a system has 90% overall accuracy so 10 percent error then you should look at the number of errors or the percentage of errors they're due to incorrect labels so it looks like in this case 6 percent of the errors are due to incorrect labels so six percent of 10 percent is 0.6 percent and then you should look at errors due to all other causes so if you made 10 percent error on your def set and a point 6 percent of those are because the labels are wrong then the remainder 9 point 4 percent of them are due to other causes such as miss recognizing dark spring attacks great calves and very images so in this case I would say there's nine point four percent worth of error that you could focus on fixing whereas you know the errors due to incorrect labels is a relatively small fraction of the overall set of errors so by all means go in and fix these incorrect labels if you want but this may be not the most important thing to do right now now let's take another example suppose you've made a lot more progress on your learning problem so instead of 10 era let's say you brought the arrows down to two percent but still zero point six percent of your overall errors are due to incorrect labels so now if you were to examine a set of mislabel def set images so set that comes from this two percent of deficit data you're misleading then a lot a very large fraction of them 0.6 divided by two percent so there's actually 30 percent rather than 6 percent of your labels of your incorrect examples are actually due to incorrectly labeled examples and so errors due to other causes are now 1.4 percent when such a higher fraction of your mistakes at least as measured on your def set are due to incorrect labels then it maybe seems much more worthwhile to fix up the incorrect labels in your depth set and if you remember the goal of the death set the main purpose of the death set is you want to really use it to help you select between two qualifiers a and B so they're trying out two confines a and B and one has 2.1 percent error and the other has 1.9 percent error on your death sentence but if you don't trust your deathbed anymore to be correctly telling you whether this classifier is actually better than this because you're open 6 percent of these mistakes I'll do two incorrect labels then that's a good reason to go in and fix the incorrect labels in your death set because in this example on the right is just having a very large impact on the overall assessments of the errors of the algorithm roasting example the nut you know the percentage impact is having on your algorithm is still smaller now if you decide to go into your dev set and manually re-examine the labels and try to fix up some of the labels here are a few additional guidelines or principles to consider first I would encourage you to apply whatever process you apply to both your depth and test set at the same time we talked previously about why you want your depth intense as the conferencing distribution the DEF set is sort of telling you where to aim target and when you hit it you want that to generalize to the test set so your team really works more efficiently the debit test sets come from the same distribution so if you're going in to fix up your dev sets I would apply the same process the test set to make sure that they continue to come from the same distribution so if you hire someone to examine your labels will carefully do that for both your dev and test sets second I would urge you to consider examining examples your algorithm got right as was once it got wrong it is easy to look at the examples your album got wrong and just see if any of those need to be fixed but it's possible that there are some examples that you haven't got right that should also be fixed and the only fixed ones that your albums got wrongly end up with a more bias estimate of the error of your algorithm it kind of gives you albumin a little bit of an unfair advantage if you just try to double-check what it got wrong but you don't also double-check what it got right because it might have gotten something right that it was you know just lucky on and fixing the label would cause it to go from being right to be wrong on that example the second bullet isn't always easy to do so it's not always done the reason it's not always done is because if you classify as very accurate then it's getting a lot fewer things wrong then right so if your classifier has your 98% accuracy then it's getting two percent of things wrong and 98 percent of things right so it's much easier to examine and very validating labels on 250 good data and it takes much longer to validate the labels on 98 percent data so this isn't always done but it's just something to consider finally if you go into your DEP and test data to correct on the label there you may or may not decide to go and apply the same process for the training set remember we said that the size this video there's actually less important to correct on tables in your training set and it's quite possible you decide to just correct the labels in your Devon test set which are also often smaller than the training set and you might not invest all that extra effort needed to correct the labels in a much larger training set this is actually okay and we'll talk later this week about some processes for handling when your training data is different and distribution than your depth and test data learning algorithms are quite robust student it's super important that your Deb in test sets come from the same distribution but if your training set comes to a slightly different distribution often that's a pretty reasonable thing to do and we'll talk more about how to handle this later this week so I'd like to wrap up with just a couple pieces of advice first these learning researchers sometimes like to say things like oh I just said the data the algorithm and I trained in and it worked and you know there is a lot of truth to that in the deep learning error there is more feeding data to an algorithm than just training it and doing less hand engineering using less human insight but I think that in building practical systems often there's also more manual analysis and more human insight that goes into these systems and sometimes deep learning researchers like to acknowledge second is that somehow I've seen some engineers and researchers be reluctant to manually look at examples maybe is not the most interesting thing to do to sit down and look at 100 or couple hundred examples to counter the number of errors but this is something that I still do myself when I'm reading machine learning to you and I want to understand what mistakes is making I will actually go in and look at the data myself and try to counter the fraction of errors and I think that because these you know minutes or maybe a small number of hours of counting data can really help you prioritize where to go next I find this a very good use of your time that urge you to consider doing it if you go to machine learning system and you're trying to decide what ideas or what directions to prioritize things so um that's it for the error analysis process in the next video I want to share view some thoughts on how error analysis fits into how you might go about starting out on a new machine learning projectthe data for your supervised learning problem comprises input X and output labels Y what have you going through your data you find that some of these upper labels Y are incorrect the up data which is incorrectly labeled is it worth your while to go in to fix up some of these labels let's take a look in the classification problem y equals 1 for cats and 0 for non cats so let's say you're looking through some data and that's a cat that smelly cat does the cat there's a cat that's not a cat that's a cat you know wait that's actually not a cat so this is an example with an incorrect label so I've used the term mislabeled examples to refer to if your learning algorithm I'll put the wrong value of Y but I'm going to say incorrectly labeled examples when to refer to if in the data set you have in a training set or the death set or the test set the label for Y whenever a human label assigned to this piece of data is actually incorrect and that's actually a dog so that Y really should have been 0 but maybe the laborer got that one wrong so if you find that your data has some incorrectly labeled examples what should you do well first let's consider the training set it turns out that deep learning algorithms are quite robust to random errors in the training set so so long as your errors or your incorrectly labeled examples so once those errors are not too far from random you know maybe sometimes on the labourer just wasn't paying attention or they accidentally randomly hit the wrong key on keyboard if the errors are reasonably random then it's probably ok to just leave the errors as they are and not spend too much time fixing them there's certainly no harm to going into your training set and be examining the labels and fixing them sometimes that is worth doing but your alpha might be okay even if you don't so long as the total data set size is big enough and actual percentage of errors is you know maybe not too high so I see a lot of machine learning algorithms that are trained even when we know that there are a few it's mistakes in the training side labels and usually works ok there is one caveat to this which is that people only albums are robust random errors they are less robust to systematic errors so for example if your labels consistently labeled white dogs as we had then that is a problem because your classifier will learn to classify on white colored dogs as cats for random errors or near random errors are usually not too bad for most deep learning algorithms now this discussion has focus on what to do about incorrectly labeled examples in your training set how about incorrectly labeled examples in your def set or test set if you're worried about the impact of incorrectly labeled examples on your deaf set or test set what I recommend you do is drink error analysis to add one extra column so that you can also count up the number of examples where the label Y was in your eggs so for example maybe when you count up the impact on a hundreds label deficit examples so you're going to find 100 examples where your concise output disagrees with the label in your def set and sometimes for a few of those examples your classifier disagrees with the label because the label was wrong rather than because you caused by wrong so maybe in this example you find that the label amidst a cat in the background so um put a check mark there to signify that example 98 had an incorrect label and maybe for this one the picture is actually a picture of a drawing of a cat garden in a row cat and maybe you want the neighborhoods of label that y equals zero while rather than y equals one and so you put another check mark there and just as you count up the percent of errors due to other categories like we saw in a previous video you'd also count up percentage of errors due to incorrect labels whether Y value in the DEF set was wrong and that accounted for while your learning algorithm made the prediction that differed from what the label on your data says so the question now is is it worthwhile going in to try to fix up you know this six percent of the incorrectly labeled examples my advice is if it makes a significant difference to your ability to evaluate algorithms on your def set then go ahead and spend the time to fix the incorrect labels but if it doesn't make a significant difference to your ability to use the DEF set to value classifiers then it might not be the best use of your time let me show you an example that illustrates what I mean by this so three numbers I recommend you look at to try to decide if it's worth going in and reducing the number of miss labor examples are the following recommend you look at the overall def set error and so in the example we had from the previous video we said that maybe a system has 90% overall accuracy so 10 percent error then you should look at the number of errors or the percentage of errors they're due to incorrect labels so it looks like in this case 6 percent of the errors are due to incorrect labels so six percent of 10 percent is 0.6 percent and then you should look at errors due to all other causes so if you made 10 percent error on your def set and a point 6 percent of those are because the labels are wrong then the remainder 9 point 4 percent of them are due to other causes such as miss recognizing dark spring attacks great calves and very images so in this case I would say there's nine point four percent worth of error that you could focus on fixing whereas you know the errors due to incorrect labels is a relatively small fraction of the overall set of errors so by all means go in and fix these incorrect labels if you want but this may be not the most important thing to do right now now let's take another example suppose you've made a lot more progress on your learning problem so instead of 10 era let's say you brought the arrows down to two percent but still zero point six percent of your overall errors are due to incorrect labels so now if you were to examine a set of mislabel def set images so set that comes from this two percent of deficit data you're misleading then a lot a very large fraction of them 0.6 divided by two percent so there's actually 30 percent rather than 6 percent of your labels of your incorrect examples are actually due to incorrectly labeled examples and so errors due to other causes are now 1.4 percent when such a higher fraction of your mistakes at least as measured on your def set are due to incorrect labels then it maybe seems much more worthwhile to fix up the incorrect labels in your depth set and if you remember the goal of the death set the main purpose of the death set is you want to really use it to help you select between two qualifiers a and B so they're trying out two confines a and B and one has 2.1 percent error and the other has 1.9 percent error on your death sentence but if you don't trust your deathbed anymore to be correctly telling you whether this classifier is actually better than this because you're open 6 percent of these mistakes I'll do two incorrect labels then that's a good reason to go in and fix the incorrect labels in your death set because in this example on the right is just having a very large impact on the overall assessments of the errors of the algorithm roasting example the nut you know the percentage impact is having on your algorithm is still smaller now if you decide to go into your dev set and manually re-examine the labels and try to fix up some of the labels here are a few additional guidelines or principles to consider first I would encourage you to apply whatever process you apply to both your depth and test set at the same time we talked previously about why you want your depth intense as the conferencing distribution the DEF set is sort of telling you where to aim target and when you hit it you want that to generalize to the test set so your team really works more efficiently the debit test sets come from the same distribution so if you're going in to fix up your dev sets I would apply the same process the test set to make sure that they continue to come from the same distribution so if you hire someone to examine your labels will carefully do that for both your dev and test sets second I would urge you to consider examining examples your algorithm got right as was once it got wrong it is easy to look at the examples your album got wrong and just see if any of those need to be fixed but it's possible that there are some examples that you haven't got right that should also be fixed and the only fixed ones that your albums got wrongly end up with a more bias estimate of the error of your algorithm it kind of gives you albumin a little bit of an unfair advantage if you just try to double-check what it got wrong but you don't also double-check what it got right because it might have gotten something right that it was you know just lucky on and fixing the label would cause it to go from being right to be wrong on that example the second bullet isn't always easy to do so it's not always done the reason it's not always done is because if you classify as very accurate then it's getting a lot fewer things wrong then right so if your classifier has your 98% accuracy then it's getting two percent of things wrong and 98 percent of things right so it's much easier to examine and very validating labels on 250 good data and it takes much longer to validate the labels on 98 percent data so this isn't always done but it's just something to consider finally if you go into your DEP and test data to correct on the label there you may or may not decide to go and apply the same process for the training set remember we said that the size this video there's actually less important to correct on tables in your training set and it's quite possible you decide to just correct the labels in your Devon test set which are also often smaller than the training set and you might not invest all that extra effort needed to correct the labels in a much larger training set this is actually okay and we'll talk later this week about some processes for handling when your training data is different and distribution than your depth and test data learning algorithms are quite robust student it's super important that your Deb in test sets come from the same distribution but if your training set comes to a slightly different distribution often that's a pretty reasonable thing to do and we'll talk more about how to handle this later this week so I'd like to wrap up with just a couple pieces of advice first these learning researchers sometimes like to say things like oh I just said the data the algorithm and I trained in and it worked and you know there is a lot of truth to that in the deep learning error there is more feeding data to an algorithm than just training it and doing less hand engineering using less human insight but I think that in building practical systems often there's also more manual analysis and more human insight that goes into these systems and sometimes deep learning researchers like to acknowledge second is that somehow I've seen some engineers and researchers be reluctant to manually look at examples maybe is not the most interesting thing to do to sit down and look at 100 or couple hundred examples to counter the number of errors but this is something that I still do myself when I'm reading machine learning to you and I want to understand what mistakes is making I will actually go in and look at the data myself and try to counter the fraction of errors and I think that because these you know minutes or maybe a small number of hours of counting data can really help you prioritize where to go next I find this a very good use of your time that urge you to consider doing it if you go to machine learning system and you're trying to decide what ideas or what directions to prioritize things so um that's it for the error analysis process in the next video I want to share view some thoughts on how error analysis fits into how you might go about starting out on a new machine learning project\n"