#16 Machine Learning Engineering for Production (MLOps) Specialization [Course 1, Week 2, Lesson 8]

**Evaluating Learning Algorithms: A Comprehensive Approach**

When it comes to learning algorithms, evaluating their performance is crucial to determine their effectiveness and accuracy. In this section, we will discuss various methods for evaluating learning algorithms, including metrics such as precision, recall, and F1 score.

**Precision and Recall: Essential Metrics for Binary Classification**

In binary classification problems with skewed datasets, precision and recall are two essential metrics for evaluating the performance of a learning algorithm. Precision measures the proportion of true positives (correctly predicted positive examples) among all predicted positive examples. On the other hand, recall measures the proportion of true positives among all actual positive examples. In the context of binary classification with skewed datasets, where there are more negative examples than positive ones, precision is not very useful. This is because most predictions will be false positives (negative examples), making it difficult to evaluate the algorithm's performance using precision.

Let us consider an example where we have 914 negative examples and 86 positive examples. If the learning algorithm outputs zero all the time, this is what the confusion matrix would look like: 914 times it outputs zero with a ground truth of zero and 86 times it outputs zero with a ground truth of one. Precision, which is true positives divided by true positives plus false positives, turns out to be zero over zero plus zero, which is not defined. Unless the algorithm actually outputs no positive labels at all, you get some number that hopefully isn't zero over zero.

**Recall: A Critical Metric for Skewed Datasets**

Recall, on the other hand, measures true positives divided by true positives plus false negatives. In this case, it turns out to be zero over zero plus 86, which is zero percent. This means that the algorithm achieves zero percent recall, indicating that it fails to detect any useful positive examples. The print zero algorithm achieves a very low recall, making precision almost useless in this scenario.

**Combining Precision and Recall: F1 Score**

To evaluate an algorithm's performance on both precision and recall, especially when one of the metrics is dominated by the other, we use the F1 score. One intuition behind the F1 score is that you want an algorithm to do well on both precision and recall. If it does worse on either precision or recall, that's pretty bad. The F1 score is a way of combining precision and recall that emphasizes whichever metric is worse. It is technically called a harmonic mean between precision and recall.

The formula for the F1 score is: F1 = 2 \* (Precision \* Recall) / (Precision + Recall). This combines both precision and recall into a single number, providing a comprehensive evaluation of an algorithm's performance on a binary classification problem. The F1 score gives you an idea of how well your model is doing on both precision and recall, helping you identify areas for improvement.

**Multi-Class Classification Problems**

In addition to binary classification problems with skewed datasets, the concepts of precision, recall, and F1 score are also applicable to multi-class classification problems. In this context, evaluating a learning algorithm's performance requires considering multiple metrics, including accuracy, precision, recall, and F1 score for each class.

For example, in defect detection applications, where there are multiple rare classes (e.g., scratches, dents, pit marks, discoloration), it is essential to evaluate the algorithm's performance on each class individually. This can be done using precision and recall metrics for each class. Combining these metrics using the F1 score provides a single number evaluation metric for how well your algorithm is doing on all classes.

**Performance Auditing: A Key Step in Model Development**

Finally, we will discuss the importance of performance auditing as a crucial step in model development. Performance auditing involves evaluating the model's performance on a dataset after training and before deployment. This step is essential to ensure that the model is working well enough before pushing it out to production.

By performing performance audits, you can identify areas for improvement, such as tuning hyperparameters or retraining the model with new data. This process helps guarantee that your learning algorithm is reliable, accurate, and effective in real-world applications. In conclusion, evaluating learning algorithms requires a comprehensive approach, including precision, recall, F1 score, and performance auditing. By using these metrics, you can ensure that your algorithm is performing well on both precision and recall, providing high-quality outputs for real-world applications.

"WEBVTTKind: captionsLanguage: endata sets where the ratio of positive to negative examples is very far from 50 50 are called skewed data sets let's look at some special techniques for handling them let me start with a manufacturing example if a manufacturing company makes smartphones hopefully the vast majority of them are not defective so if 99.7 have no defect and are labeled y equals zero and only a small fraction is labeled y equals one then print zero which is not a very impressive learning algorithm will achieve 99.7 accuracy or medical diagnosis which was the example we went through in an earlier video if 99 of patients don't have a disease then an algorithm that predicts no one ever has a disease will have 99 accuracy or speech recognition if you're building a system for wake word detection sometimes also called trigger word detection these are systems that listen and see if you say a special word like alexa or ok google or hey siri most of the time that special wake word or trigger word is not being spoken by anyone at that moment in time so when i had built wake word detection systems the data sets were actually quite skewed one of the datasets i use had 96.7 negative examples and 3.3 positive examples when you have a very skewed data set like this raw accuracy is not that useful a metric to look at because prints zero can get very high accuracy instead it's more useful to build something called the confusion matrix a confusion matrix is a matrix where one axis is labeled with the actual label so it's the ground truth label y equals zero or y equals one and whose other axis is labeled with the prediction so was your learning algorithms prediction y equals zero or y equals one so if you're building a confusion matrix you throw in with each of these four cells the total number of examples say the number of examples in your dev set in your development set that fell into each of these four buckets let's say that 905 examples in your development set had a ground truth label of y equals zero and your algorithm got it right then you might write 905 there these examples are called true negatives because they were actually negative and your algorithm predicted it was negative next let's throw in the true positives which are the examples where the actual ground truth label is one and the prediction is one maybe there are 68 of them true positives the false negatives are the examples where your algorithm thought it was negative but it was wrong the actual label is positive so these are false negatives some of the 18 of that and lastly false positives are the ones where your algorithm thought it was positive but that turned out to be false so 9 false positives the precision of the learning algorithm if i sum up over the columns 905 plus 9 is 940 and 18 68 is 86. so this is indeed a pretty skewed data set where out of a thousand examples there were 914 negative examples in just 86 positive examples so 8.6 positive 91.4 percent negative the precision of your learning algorithm is defined as follows it asks of all the examples that the average thought were positive examples what fraction did it get right so precision as is defined as true positives divided by true positives plus false positives in other words it looks at this row so of all the examples that your algorithm thought had a label of 1 which is 68 plus 9 of them 68 of them were actually right so the position is 68 over 68 plus 9 which is a 88.3 in contrast the recall asks of all the examples that were actually positive what fraction did your algorithm get right so recall is defined as true positives divided by true positives plus false negatives which in this case is 68 over 68 plus 18 which is 79.1 and the metrics of precision and recall are more useful than raw accuracy when it comes to evaluating the performance of learning algorithms on very skewed data sets let's see what happens if your learning algorithm outputs zero all the time it turns out it won't do very well on recall taking this example of where we had 914 negative examples and 86 positive examples if the algorithm outputs 0 all the time this is what the confusion matrix would look like 914 times it outputs zero with a ground truth of zero and 86 times it output zero with a ground truth of one so precision is true positives divided by true positives plus false positives which in this case turns out to be zero over zero plus zero which is not defined and unless your algebra actually output no positive labels at all you get some of the number that hopefully isn't zero over zero but more importantly if you look at recall which is true positives over true positives plus false negatives this turns out to be zero over zero plus 86 which is zero percent and so the print zero algorithm achieves zero percent recall which gives you an easy way to flag that this is not detecting any useful positive examples and the learning algorithm with some precision evens the high value precision is not that useful usually if this recall is so low so the standard metrics when i look at when comparing different models on skewed data sets are precision and recall where looking at these numbers helps you figure out and of all the examples that are truly positive examples what fraction did the algorithm manage to catch sometimes you have one model with a better recall and a different model with a better precision so how do you compare two different models there's a common way of combining precision and recall using this formula which is called the f1 score one intuition behind the f1 score is that you want an algorithm to do well on both precision and recall and if it does worse on either precision or recall that's pretty bad and so f1 is a way of combining precision and recall that emphasizes whichever of p or r positional recall is worse in mathematics this is technically called a harmonic mean between precision and recall which is like taking an average but placing more emphasis on whichever is the lower number so if you compute the f1 score of these two models it turns out it to be 83.4 using the formula below here and model 2 has a very bad recall so its f1 score is actually quite low as well and this lets us tell maybe more clearly that model 1 appears to be a superior model than model 2. for your application you may have a different weighting between precision and recall and so f1 isn't the only way to combine precision and recall it's just one metric that's commonly used for many applications let me step through one more example where precision and recall is useful so far we've talked about the binary classification problem with skewed datasets it turns out to also frequently be useful for multi-class classification problems if you're detecting defects in smartphones you may want to detect scratches on them or dents or pit marks this is what it looks like if someone took a screwdriver and poked a cell phone or discoloration of the cell phone's lcd screen or other material maybe all four of these defects are actually quite rare but you might want to develop an algorithm that can detect all four of them one way to evaluate how your album is doing on all four of these defects each of which can be quite rare would be to look at precision and recall of each of these four types of defects individually in this example the learning algorithm has 82.1 precision on finding scratches and 99.2 recall you find in manufacturing that many factories will want high recall because you really don't want to let the phone go out that is defective but if an algorithm has slightly lower precision that's okay because through a human re-examining the phone they will hopefully figure out that the phone is actually okay so many factories will emphasize high recall and by combining precision recall using f1 as follows this gives you a single number evaluation metric for how well your lram is doing on the four different types of defects and can also help you benchmark to human level performance and also prioritize what to work on next so instead of accuracy on scratches dense pit marks and discolorations using f1 score can help you to prioritize the most fruitful type of defect to try to work on and the reason we use f1 is because maybe all four defects are very rare and so accuracy would be very high even if the algorithm was missing a lot of these defects so i hope that these tools will help you both evaluate your algorithm as well as prioritize what to work on both in problems with skewed data sets and for problems with multiple rare classes now to wrap up this section on error analysis there's one final concept i hope to go over with you which is performance auditing i found for many projects this is a key step to make sure that your learning algorithm is working well enough before you push it out to a production deployment let's take a look at performance auditingdata sets where the ratio of positive to negative examples is very far from 50 50 are called skewed data sets let's look at some special techniques for handling them let me start with a manufacturing example if a manufacturing company makes smartphones hopefully the vast majority of them are not defective so if 99.7 have no defect and are labeled y equals zero and only a small fraction is labeled y equals one then print zero which is not a very impressive learning algorithm will achieve 99.7 accuracy or medical diagnosis which was the example we went through in an earlier video if 99 of patients don't have a disease then an algorithm that predicts no one ever has a disease will have 99 accuracy or speech recognition if you're building a system for wake word detection sometimes also called trigger word detection these are systems that listen and see if you say a special word like alexa or ok google or hey siri most of the time that special wake word or trigger word is not being spoken by anyone at that moment in time so when i had built wake word detection systems the data sets were actually quite skewed one of the datasets i use had 96.7 negative examples and 3.3 positive examples when you have a very skewed data set like this raw accuracy is not that useful a metric to look at because prints zero can get very high accuracy instead it's more useful to build something called the confusion matrix a confusion matrix is a matrix where one axis is labeled with the actual label so it's the ground truth label y equals zero or y equals one and whose other axis is labeled with the prediction so was your learning algorithms prediction y equals zero or y equals one so if you're building a confusion matrix you throw in with each of these four cells the total number of examples say the number of examples in your dev set in your development set that fell into each of these four buckets let's say that 905 examples in your development set had a ground truth label of y equals zero and your algorithm got it right then you might write 905 there these examples are called true negatives because they were actually negative and your algorithm predicted it was negative next let's throw in the true positives which are the examples where the actual ground truth label is one and the prediction is one maybe there are 68 of them true positives the false negatives are the examples where your algorithm thought it was negative but it was wrong the actual label is positive so these are false negatives some of the 18 of that and lastly false positives are the ones where your algorithm thought it was positive but that turned out to be false so 9 false positives the precision of the learning algorithm if i sum up over the columns 905 plus 9 is 940 and 18 68 is 86. so this is indeed a pretty skewed data set where out of a thousand examples there were 914 negative examples in just 86 positive examples so 8.6 positive 91.4 percent negative the precision of your learning algorithm is defined as follows it asks of all the examples that the average thought were positive examples what fraction did it get right so precision as is defined as true positives divided by true positives plus false positives in other words it looks at this row so of all the examples that your algorithm thought had a label of 1 which is 68 plus 9 of them 68 of them were actually right so the position is 68 over 68 plus 9 which is a 88.3 in contrast the recall asks of all the examples that were actually positive what fraction did your algorithm get right so recall is defined as true positives divided by true positives plus false negatives which in this case is 68 over 68 plus 18 which is 79.1 and the metrics of precision and recall are more useful than raw accuracy when it comes to evaluating the performance of learning algorithms on very skewed data sets let's see what happens if your learning algorithm outputs zero all the time it turns out it won't do very well on recall taking this example of where we had 914 negative examples and 86 positive examples if the algorithm outputs 0 all the time this is what the confusion matrix would look like 914 times it outputs zero with a ground truth of zero and 86 times it output zero with a ground truth of one so precision is true positives divided by true positives plus false positives which in this case turns out to be zero over zero plus zero which is not defined and unless your algebra actually output no positive labels at all you get some of the number that hopefully isn't zero over zero but more importantly if you look at recall which is true positives over true positives plus false negatives this turns out to be zero over zero plus 86 which is zero percent and so the print zero algorithm achieves zero percent recall which gives you an easy way to flag that this is not detecting any useful positive examples and the learning algorithm with some precision evens the high value precision is not that useful usually if this recall is so low so the standard metrics when i look at when comparing different models on skewed data sets are precision and recall where looking at these numbers helps you figure out and of all the examples that are truly positive examples what fraction did the algorithm manage to catch sometimes you have one model with a better recall and a different model with a better precision so how do you compare two different models there's a common way of combining precision and recall using this formula which is called the f1 score one intuition behind the f1 score is that you want an algorithm to do well on both precision and recall and if it does worse on either precision or recall that's pretty bad and so f1 is a way of combining precision and recall that emphasizes whichever of p or r positional recall is worse in mathematics this is technically called a harmonic mean between precision and recall which is like taking an average but placing more emphasis on whichever is the lower number so if you compute the f1 score of these two models it turns out it to be 83.4 using the formula below here and model 2 has a very bad recall so its f1 score is actually quite low as well and this lets us tell maybe more clearly that model 1 appears to be a superior model than model 2. for your application you may have a different weighting between precision and recall and so f1 isn't the only way to combine precision and recall it's just one metric that's commonly used for many applications let me step through one more example where precision and recall is useful so far we've talked about the binary classification problem with skewed datasets it turns out to also frequently be useful for multi-class classification problems if you're detecting defects in smartphones you may want to detect scratches on them or dents or pit marks this is what it looks like if someone took a screwdriver and poked a cell phone or discoloration of the cell phone's lcd screen or other material maybe all four of these defects are actually quite rare but you might want to develop an algorithm that can detect all four of them one way to evaluate how your album is doing on all four of these defects each of which can be quite rare would be to look at precision and recall of each of these four types of defects individually in this example the learning algorithm has 82.1 precision on finding scratches and 99.2 recall you find in manufacturing that many factories will want high recall because you really don't want to let the phone go out that is defective but if an algorithm has slightly lower precision that's okay because through a human re-examining the phone they will hopefully figure out that the phone is actually okay so many factories will emphasize high recall and by combining precision recall using f1 as follows this gives you a single number evaluation metric for how well your lram is doing on the four different types of defects and can also help you benchmark to human level performance and also prioritize what to work on next so instead of accuracy on scratches dense pit marks and discolorations using f1 score can help you to prioritize the most fruitful type of defect to try to work on and the reason we use f1 is because maybe all four defects are very rare and so accuracy would be very high even if the algorithm was missing a lot of these defects so i hope that these tools will help you both evaluate your algorithm as well as prioritize what to work on both in problems with skewed data sets and for problems with multiple rare classes now to wrap up this section on error analysis there's one final concept i hope to go over with you which is performance auditing i found for many projects this is a key step to make sure that your learning algorithm is working well enough before you push it out to a production deployment let's take a look at performance auditing\n"