#31 Machine Learning Engineering for Production (MLOps) Specialization [Course 1, Week 3, Lesson 7]

The Importance of Human Level Performance (hrp) in Machine Learning

Human level performance (hrp) is an important metric in machine learning, particularly for applications where human-level accuracy is necessary. However, when working with human labels, it's common to encounter issues with consistency and accuracy. In this article, we'll explore the importance of hrp, how to measure it, and what happens when the gap between hrp and 100% accuracy arises.

Raising Human Level Performance

When the ground truth label y comes from a human, hrp being quite a bit less than 100 may just indicate that the labeling instructions or labeling convention is ambiguous. This can happen in various ways, such as visual inspection, where two similar images are labeled differently, or speech recognition, where words like "um," comma, versus ellipsis (dot dot), and other punctuation marks can cause confusion. In these cases, improving labeling consistency will raise human level performance, which ultimately benefits the actual application.

The Benefits of Improved Labeling Consistency

By raising hrp to a consistent level, we create cleaner and more consistent data, which is essential for training effective learning algorithms. This approach may seem counterintuitive at first, as it makes it harder for machine learning algorithms to beat human-level performance. However, the benefits far outweigh the drawbacks. With improved labeling consistency, we can:

* Develop more accurate and reliable machine learning models

* Reduce errors and inconsistencies in the data

* Improve overall system performance and reliability

The Gap Between Hrp and 100% Accuracy

While hrp is an important metric, it's essential to recognize that a gap between hrp and 100% accuracy may exist due to inconsistent labeling instructions. This can happen when:

* Labeling conventions are unclear or ambiguous

* Human labels are subjective and influenced by individual biases

* Data quality issues, such as missing or corrupted data

In these cases, improving labeling consistency is crucial not only for raising hrp but also for providing cleaner and more accurate data for machine learning algorithms. By addressing these issues, we can create a more reliable and effective machine learning pipeline.

Structuring the Labeling Process

While inconsistent labeling instructions are a significant issue in many applications, they're not unique to unstructured data. Even structured data problems can benefit from improved labeling consistency. For instance:

* User ID merging: In some cases, human labels may be necessary for tasks like user ID merging or identifying suspicious activity.

* GPS-based predictions: When analyzing GPS traces to determine mode of transportation, human expertise is essential.

* Fraud detection: Identifying fraudulent transactions requires human judgment and expertise.

In these situations, asking a human to label the data on the first pass can provide valuable insights and improve overall accuracy. However, this approach should be complemented by machine learning algorithms that learn from the labeled data to make predictions.

Conclusion

Human level performance (hrp) is an essential metric in machine learning, particularly for applications where human-level accuracy is necessary. By measuring hrp, we can gain a better understanding of what's possible and drive error analysis and prioritization. However, when encountering issues with consistency and accuracy, it's crucial to address the root cause – often related to labeling instructions or conventions. Improving labeling consistency not only raises hrp but also provides cleaner and more accurate data for machine learning algorithms. By acknowledging these challenges and taking steps to improve labeling consistency, we can create a more reliable and effective machine learning pipeline that benefits both humans and machines.

"WEBVTTKind: captionsLanguage: eni think the use of hlp and machine learning had taken off partly because it helped people get papers published to show they can beat hlp there's also been a bit misused in settings where the goal is to build a valuable application not just to publish a paper when the ground truth is externally defined then there are fewer problems with hlp when the ground tree really is some real ground truth for example i've done a lot of work on medical imaging uh working on you know ai for diagnosing from x-rays or things like these and given an x-ray image if you want to predict the diagnosis if the diagnosis is defined according to say a biopsy so a biological or medical test then hlp helps you measure how well does a doctor versus a learning algorithm predict the outcome of a biopsy or a biological medical test i find that to be really useful but when the ground truth is defined by a human maybe even a doctor labeling an x-ray image then hlp is just measuring how well can one doctor predict another doctor's label versus how well can one learning algorithm predict another doctor's label and that too is useful but it's different than if you're measuring how well you versus a doctor are predicting some ground truth outcome from a medical biopsy so to summarize when the ground truth label is externally defined such as the medical biopsy then hrp gives an estimate for base error and irreducible error in terms of predicting the outcome of that medical test the biopsy but there are also a lot of problems where the ground truth is just another human label the visual inspection example we had from the previous video showed this where the inspector had 66.7 accuracy rather than just aspiring to beat the human inspector it may be more useful to see why the ground truth which is just some other inspector compared to this inspector don't agree for example if we look at the length of the different scratches that they label say on these six examples these were the lengths of the scratches and if we speak of the inspectors and have them agree that 0.3 mm is the threshold above which a stretch becomes a defect then what we would do is then what we realize is that for the first example both labeled it one totally appropriately for the second example the ground truth here is one but is less than 0.3 so we really should change this to 0 then 0.5 gets 1 1 0.200.1 and this example has a stretch of 0.1 but really this should have been a zero if we go through this exercise of getting the ground truth labor and this inspector to agree then we actually just raise human level performance from 66.7 percent to 100 at least as measured on these six examples so but notice what we've done by raising hlp to 100 we've made it pretty much impossible for a learning algorithm to beat hlp so that seems terrible you can't tell the business owner anymore you beat hlp and does they must use your system but the benefit of this is you now have much cleaner more consistent data and that ultimately will allow your learning algorithm to do better so when you go is to come up with a learning algorithm that actually generates accurate predictions rather than just prove for some reason that you can beat hlp i find this approach of working to raise hlp to be more useful to summarize when the ground truth label y comes from a human hlp being quite a bit less than 100 may just indicate that the labeling instructions or labeling convention is ambiguous on the last slide you saw an example of this in visual inspection you also see this in speech recognition where the um comma versus um ellipsis dot dot that type of ambiguous labeling convention will also cause hlp to be less than 100 improving labor consistency will raise human level performance and this makes it harder unfortunately for your learning algorithm to beat hlp but the more consistent labels will raise your machine learning algorithm performance which is ultimately likely to benefit the actual application so far we've been discussing hrp on unstructured data but some of these issues apply to structured data as well you already know that structured data problems are less likely to involve human labors and thus hlp is less frequently used but there are exceptions you saw previously the user id merging example where you might have a human label where the two records belong to the same person or i've worked on projects where we will look at network traffic into a computer to try to figure out if the computer was hacked and we asked human i.t experts to provide labels for us sometimes it's hard to know if a transaction is fraudulent and you just ask a human to label that or is this account a spam account or a bot generated account or from gps what is the mode of transportation is this person on foot or on a bike or in the car or on the bus it turns out buses stop at bus stops and so you can actually kind of tell if someone's in the bus or in the car based on their gps trace and for problems like these it would be quite reasonable to ask a human to label the data at least on the first pass for a learning algorithm to make such predictions as these and so when the ground truth label you're trying to predict comes from one human the same questions of what does hlp mean it is a useful baseline to figure out what is possible but sometimes when measuring hlp you realize that low hlp stems from inconsistent labels and working to improve hlp by coming up with a more consistent labeling standard will both raise hlp and give you cleaner data with which to improve your learning algorithms performance here's what i hope you take away from this video first hlp is important for problems where human level performance can provide a useful reference i do measure it and use it as a reference for what might be possible and to drive error analysis and prioritization having said that when you're measuring hlp if you find the hlp is much less than 100 also ask yourself if some of the gap between hlp and complete consistency is due to inconsistent labeling instructions because if that turns out to be the case then improving labeling consistency will raise hlp and also give cleaner data for your learning algorithm which will ultimately result in better machine learning algorithm performance here's what i hope you take away from this video hlp is useful and important for many applications for problems where i think how well humans perform is a useful reference i do measure hrp and i use that to get a sense of what might be possible and also use hlp to drive error analysis and prioritization having said that if in the process of measuring hlp you find that hlp is much less than perfect performance much lower than 100 this is also worth asking yourself if that gap between hlp and 100 accuracy may be due to inconsistent labeling instructions because if that's the case then improving labeling consistency will both raise hlp but more importantly help you get cleaner and more consistent labels which will improve your learning algorithm's performancei think the use of hlp and machine learning had taken off partly because it helped people get papers published to show they can beat hlp there's also been a bit misused in settings where the goal is to build a valuable application not just to publish a paper when the ground truth is externally defined then there are fewer problems with hlp when the ground tree really is some real ground truth for example i've done a lot of work on medical imaging uh working on you know ai for diagnosing from x-rays or things like these and given an x-ray image if you want to predict the diagnosis if the diagnosis is defined according to say a biopsy so a biological or medical test then hlp helps you measure how well does a doctor versus a learning algorithm predict the outcome of a biopsy or a biological medical test i find that to be really useful but when the ground truth is defined by a human maybe even a doctor labeling an x-ray image then hlp is just measuring how well can one doctor predict another doctor's label versus how well can one learning algorithm predict another doctor's label and that too is useful but it's different than if you're measuring how well you versus a doctor are predicting some ground truth outcome from a medical biopsy so to summarize when the ground truth label is externally defined such as the medical biopsy then hrp gives an estimate for base error and irreducible error in terms of predicting the outcome of that medical test the biopsy but there are also a lot of problems where the ground truth is just another human label the visual inspection example we had from the previous video showed this where the inspector had 66.7 accuracy rather than just aspiring to beat the human inspector it may be more useful to see why the ground truth which is just some other inspector compared to this inspector don't agree for example if we look at the length of the different scratches that they label say on these six examples these were the lengths of the scratches and if we speak of the inspectors and have them agree that 0.3 mm is the threshold above which a stretch becomes a defect then what we would do is then what we realize is that for the first example both labeled it one totally appropriately for the second example the ground truth here is one but is less than 0.3 so we really should change this to 0 then 0.5 gets 1 1 0.200.1 and this example has a stretch of 0.1 but really this should have been a zero if we go through this exercise of getting the ground truth labor and this inspector to agree then we actually just raise human level performance from 66.7 percent to 100 at least as measured on these six examples so but notice what we've done by raising hlp to 100 we've made it pretty much impossible for a learning algorithm to beat hlp so that seems terrible you can't tell the business owner anymore you beat hlp and does they must use your system but the benefit of this is you now have much cleaner more consistent data and that ultimately will allow your learning algorithm to do better so when you go is to come up with a learning algorithm that actually generates accurate predictions rather than just prove for some reason that you can beat hlp i find this approach of working to raise hlp to be more useful to summarize when the ground truth label y comes from a human hlp being quite a bit less than 100 may just indicate that the labeling instructions or labeling convention is ambiguous on the last slide you saw an example of this in visual inspection you also see this in speech recognition where the um comma versus um ellipsis dot dot that type of ambiguous labeling convention will also cause hlp to be less than 100 improving labor consistency will raise human level performance and this makes it harder unfortunately for your learning algorithm to beat hlp but the more consistent labels will raise your machine learning algorithm performance which is ultimately likely to benefit the actual application so far we've been discussing hrp on unstructured data but some of these issues apply to structured data as well you already know that structured data problems are less likely to involve human labors and thus hlp is less frequently used but there are exceptions you saw previously the user id merging example where you might have a human label where the two records belong to the same person or i've worked on projects where we will look at network traffic into a computer to try to figure out if the computer was hacked and we asked human i.t experts to provide labels for us sometimes it's hard to know if a transaction is fraudulent and you just ask a human to label that or is this account a spam account or a bot generated account or from gps what is the mode of transportation is this person on foot or on a bike or in the car or on the bus it turns out buses stop at bus stops and so you can actually kind of tell if someone's in the bus or in the car based on their gps trace and for problems like these it would be quite reasonable to ask a human to label the data at least on the first pass for a learning algorithm to make such predictions as these and so when the ground truth label you're trying to predict comes from one human the same questions of what does hlp mean it is a useful baseline to figure out what is possible but sometimes when measuring hlp you realize that low hlp stems from inconsistent labels and working to improve hlp by coming up with a more consistent labeling standard will both raise hlp and give you cleaner data with which to improve your learning algorithms performance here's what i hope you take away from this video first hlp is important for problems where human level performance can provide a useful reference i do measure it and use it as a reference for what might be possible and to drive error analysis and prioritization having said that when you're measuring hlp if you find the hlp is much less than 100 also ask yourself if some of the gap between hlp and complete consistency is due to inconsistent labeling instructions because if that turns out to be the case then improving labeling consistency will raise hlp and also give cleaner data for your learning algorithm which will ultimately result in better machine learning algorithm performance here's what i hope you take away from this video hlp is useful and important for many applications for problems where i think how well humans perform is a useful reference i do measure hrp and i use that to get a sense of what might be possible and also use hlp to drive error analysis and prioritization having said that if in the process of measuring hlp you find that hlp is much less than perfect performance much lower than 100 this is also worth asking yourself if that gap between hlp and 100 accuracy may be due to inconsistent labeling instructions because if that's the case then improving labeling consistency will both raise hlp but more importantly help you get cleaner and more consistent labels which will improve your learning algorithm's performance\n"