When to Change Dev_Test Sets (C3W1L07)

**Defining Evaluation Metrics and Testing Sets: A Guide to Machine Learning Performance Optimization**

In machine learning, defining an evaluation metric is crucial for determining the performance of a classifier or algorithm on a specific task. The goal of an evaluation metric is to capture what you want to achieve, whether it's streaming out porn, predicting medical diagnoses, or recognizing images. However, this process can be complex and nuanced, and there are often multiple approaches that can be taken.

One key principle in defining an evaluation metric is to break down the machine learning problem into distinct steps. This means identifying the target of the algorithm and determining how well it performs on that specific task. The target analogy can be applied here, with one step being to place the target and another step being to aim and shoot at it. In other words, defining an evaluation metric is one step, while figuring out how to do well on that metric is a completely separate problem.

**The Importance of Defining Evaluation Metrics**

Defining an evaluation metric is essential for determining the performance of a classifier or algorithm. Without a clear understanding of what you want to achieve, it's impossible to determine whether your algorithm is working effectively. In machine learning, the goal is often to minimize some kind of loss function or cost function that reflects the difference between the predicted output and the actual output.

One way to approach this problem is to use a target analogy, where one knob or step is to define the metric that captures what you want to achieve. This means identifying what success looks like for your algorithm on your specific task. Once you've defined the metric, it's possible to optimize the algorithm to perform better on that metric.

**Real-World Example: Streaming Out Porn**

One example of how this process works in practice is when evaluating classifiers that are designed to stream out pornographic content from images. In this case, the goal is to define a metric that captures what you want to achieve, such as accurately identifying and labeling adult content. However, there may be challenges with defining an evaluation metric, such as whether it's possible to create a fair and unbiased test set.

For instance, suppose two cat classifiers, A and B, have different performance rates on a training set or test set. Classifier A has a 3% error rate, while classifier B has a 5% error rate. However, when deploying these classifiers in real-world scenarios, such as on a mobile app, classifier B may actually perform better than classifier A. This is because the app's users are uploading low-quality images that may not be well-represented in the original training set.

**The Problem with Evaluating on Different Test Sets**

In this scenario, the problem arises when evaluating the classifiers on different test sets. If the evaluation metric was defined based on the quality of the images in the training set, but the app is deployed on a mobile device where users are uploading lower-quality images, then classifier B may actually be performing better than classifier A because it's optimized for those lower-quality images.

This highlights the importance of testing on the actual deployment scenario, rather than just using the original training set. This is often referred to as "adversarial testing" or "out-of-distribution testing," where the algorithm is evaluated on data that is not seen during training.

**Guidelines for Defining Evaluation Metrics and Testing Sets**

So, when should you change your evaluation metric or testing set? The answer is simple: if it doesn't correspond to what you actually care about. In other words, if your current metrics and test sets are evaluating your algorithm on something that isn't relevant to the real-world problem you're trying to solve, then it's time to adjust.

One guideline for machine learning teams is to run with a poorly defined evaluation metric or testing set for too long. This can slow down the team's ability to iterate and improve their algorithm. However, if they realize early on that something isn't working, it's better to change course than to continue investing in an approach that may not be effective.

By defining an evaluation metric and testing set carefully, machine learning teams can optimize their algorithms more efficiently and effectively. This means identifying what success looks like for your algorithm, optimizing the algorithm to perform well on that metric, and then using the results of this optimization to drive future improvements.

"WEBVTTKind: captionsLanguage: enyou've seen how such ever death sets and evaluation metric is like placing a target somewhere for your team to aim at but sometimes partway through a project you might realize you put your target in the wrong place in that case you should move your target let's take a look at an example let's say you built a CAD classifier to try to find lots of pictures of cats to show to your can't loving users and the message if you decide to use is classification error so Avram's a and B have respectively three percent error and five percent error so it seems like Gotham a is doing better but let's say you try out these algorithms you look at these algorithms and album a for some reason is letting through a lot of pornographic images so if you ship algorithm a the users would see more cat images because you know see percent error is identifying cats but it also shows you users some holographic images which is totally unacceptable but for your company as well as for your users and in contrast album B as five percent error so it lists classifies fewer images but it doesn't have pornographic images so from your company's point of view as well as from a user acceptance point of view album B is actually much better algorithm because you know it's not learning through any common graphics images so what's happened in this example is that album a is doing better on the evaluation metric is giving three percent error but is actually worse algorithm so in this case the evaluation metric plus the DEF set it prefers album a because the same look around a has lower error which is the metric is using when you're you and your users prefer a ver and B because it's not learning through Paulo graphics images so when this happens when your evaluation metric is no longer correctly rank ordering preferences between algorithms in this case is miss predicting that album a is a better algorithm then there's a sign that you should change your evaluation metric or perhaps you're on development sets or test set so in this case Dimas classification error metric that you're using can be written as follows as 1 over M a number of examples in your development set of sum from I equals 1 to M and depth number of examples of development set of indicator whether or not the prediction example I a development set is not equal to the actual label I wear this is useless notation to denote about their predictive value right so either 0 and this indicator function notation counts up the number of examples on which you know this thing in sign is true so this formula just counts up the number of misclassifying examples so the problem with this evaluation metric is that they trees pornographic and non pornographic images equally but you really want to conspire to not miss label pornographic images like maybe recognize a pornographic images academician therefore show it to a unsuspecting users therefore very unhappy with you know unexpectedly seeing porn so one way to change this evaluation metric would be if you had a weight term here and we call this WI where WI is going to be equal to 1 in front X I is non porn and maybe 10 or maybe even larger number like 100 if X I is porn so this way you're giving a much larger weight to examples that are pornographic so that the error term goes up much more on the algorithm this makes mistakes on classifying appalling graphics images a cat image and in this example you're giving a 10 times bigger wings to classifying on the graphic images correctly and if you want this normalization constant it will be right technically this becomes sum over I of WI so then this error would still be between 0 on 1 so the details of this weighting aren't important and actually influences waiting you need to actually go through your def in 10 sets to label the poly graphic images in your def and test sets you can implement this weighting function but a high level takeaway is if you find that you valuation metric is not giving the correct rank order preference for what is actually better algorithm then that's a time to think about defining a new evaluation metric that you know and this is just one possible way that you could define an evaluation metric and the goal of the evaluation metric is to accurately tell you give in to class size which one is better for your application for the purpose of this video don't worry too much about the details of how we define a new error metric the point is that if you're not satisfied with your own error metric then don't keep coasting with an error metric you're unsatisfied with is that try to define a new one that you think better captures your preferences in terms of what's actually a better algorithm one thing you might notice is that so far we've only talked about how to define a metric to evaluate classifiers that is what defines an evaluation metric that helps us better rank all their classifiers when they are performing at varying levels in terms of streaming out porn and this is actually an example of a full realization where I think you should take a machine learning problem and break it into distinct steps so one knob or one step is to figure out how to define a metric they captures what you want to do and I will is separately about how to actually do well on this metric so I think of the machine learning tasks as two distinct steps to use the target analogy the first step is to place the target so define where you want to aim and then it's a completely separate step so this is one knife you could tune which is how do you place the target and it's completely separate problem think of it as a separate knob to tune in terms of how to do well at a target or how to you know aim accurately or how to shoot at the target and so defining a metric is step one and you do something else for step two so in terms of shooting the target maybe your learning algorithm is optimizing some cost function that looks like this are you minimizing this you know some losses on your training set so one thing you could do is to also modify this in order to incorporate these weights and maybe you end up changing this normalization constant as well suggest one over sum of WI between the details of how you define J is important but the point was with the philosophy for formalization think of placing the target that's one step and aiming and shooting the target as a distinct step which you do separately so so in other words I encourage you to think of defining the metrics as one step and then you can and only after you've defined in metric figure out how to do well on that metric which might be changing the cost function J that your network is optimizing but for going on let's look at just one more example let's say that you're two cat classifiers a and B have respectively 3 percent error and 5 percent error as evaluated on your death set or maybe even on your test set which are images downloaded off the internet so high quality well frame images but maybe we need to tour your album a product you find that averin B actually looks like it's performing better even though it's doing better or your death set and you find that you've been training off very nice high quality images download without the internet but when you deploy this on a mobile app users are uploading all sorts of pictures much less trained you're going public the Catholic has twenty fish expressions maybe images are much blurrier and when you test out your algorithms you find that a from B is actually doing better so this would be another example of your metrics and depth test set falling down and the problem is that you're evaluating on a different test set a very nice high resolution well fairing images but what your users really care about is the album doing well on images they're uploading which are maybe less professional shots and blurrier and less wall frames so the guideline is they're doing well on your metrics and your current deficit or different test sets distribution if that does not correspond to doing well on the application you actually care about then change your metrics and/or your depth test set so in other words you can discover that your dev test set has these very high quality images but evaluating on this dev test set is not predictive of how well your app actually performs because you asked me to view of lower quality images then that's a good time to change your dev test set so that your data better reflects the type of data you actually need to do well on but your vote guideline is really if your current metrics and data and your validating on doesn't correspond to do well and what you actually care about and change your metric and/or your depth test set submit to capture what you need the album to actually do well on having the evaluation metrics and the death set allows you to much more quickly make decisions about this album a or ever be better so it really speeds up how quickly you or your team can iterate so my recommendation is even if you can't define the perfect evaluation metric and DEF set just set something up quickly and use that to drive the speed of your team iterating and then later down the line you find out that it wasn't a good one you're a better idea change it at that time it's perfectly okay but what I recommend it so most teams is to run for too long without an evaluation metric and depth set because that can slow down the efficiency or who should team can iterate and improve your algorithm so that's it on when to change your evaluation measures and/or depth and test sets hope that these guidelines help you serve your whole team to have a well-defined target you can iterate efficiently towards improving performance onyou've seen how such ever death sets and evaluation metric is like placing a target somewhere for your team to aim at but sometimes partway through a project you might realize you put your target in the wrong place in that case you should move your target let's take a look at an example let's say you built a CAD classifier to try to find lots of pictures of cats to show to your can't loving users and the message if you decide to use is classification error so Avram's a and B have respectively three percent error and five percent error so it seems like Gotham a is doing better but let's say you try out these algorithms you look at these algorithms and album a for some reason is letting through a lot of pornographic images so if you ship algorithm a the users would see more cat images because you know see percent error is identifying cats but it also shows you users some holographic images which is totally unacceptable but for your company as well as for your users and in contrast album B as five percent error so it lists classifies fewer images but it doesn't have pornographic images so from your company's point of view as well as from a user acceptance point of view album B is actually much better algorithm because you know it's not learning through any common graphics images so what's happened in this example is that album a is doing better on the evaluation metric is giving three percent error but is actually worse algorithm so in this case the evaluation metric plus the DEF set it prefers album a because the same look around a has lower error which is the metric is using when you're you and your users prefer a ver and B because it's not learning through Paulo graphics images so when this happens when your evaluation metric is no longer correctly rank ordering preferences between algorithms in this case is miss predicting that album a is a better algorithm then there's a sign that you should change your evaluation metric or perhaps you're on development sets or test set so in this case Dimas classification error metric that you're using can be written as follows as 1 over M a number of examples in your development set of sum from I equals 1 to M and depth number of examples of development set of indicator whether or not the prediction example I a development set is not equal to the actual label I wear this is useless notation to denote about their predictive value right so either 0 and this indicator function notation counts up the number of examples on which you know this thing in sign is true so this formula just counts up the number of misclassifying examples so the problem with this evaluation metric is that they trees pornographic and non pornographic images equally but you really want to conspire to not miss label pornographic images like maybe recognize a pornographic images academician therefore show it to a unsuspecting users therefore very unhappy with you know unexpectedly seeing porn so one way to change this evaluation metric would be if you had a weight term here and we call this WI where WI is going to be equal to 1 in front X I is non porn and maybe 10 or maybe even larger number like 100 if X I is porn so this way you're giving a much larger weight to examples that are pornographic so that the error term goes up much more on the algorithm this makes mistakes on classifying appalling graphics images a cat image and in this example you're giving a 10 times bigger wings to classifying on the graphic images correctly and if you want this normalization constant it will be right technically this becomes sum over I of WI so then this error would still be between 0 on 1 so the details of this weighting aren't important and actually influences waiting you need to actually go through your def in 10 sets to label the poly graphic images in your def and test sets you can implement this weighting function but a high level takeaway is if you find that you valuation metric is not giving the correct rank order preference for what is actually better algorithm then that's a time to think about defining a new evaluation metric that you know and this is just one possible way that you could define an evaluation metric and the goal of the evaluation metric is to accurately tell you give in to class size which one is better for your application for the purpose of this video don't worry too much about the details of how we define a new error metric the point is that if you're not satisfied with your own error metric then don't keep coasting with an error metric you're unsatisfied with is that try to define a new one that you think better captures your preferences in terms of what's actually a better algorithm one thing you might notice is that so far we've only talked about how to define a metric to evaluate classifiers that is what defines an evaluation metric that helps us better rank all their classifiers when they are performing at varying levels in terms of streaming out porn and this is actually an example of a full realization where I think you should take a machine learning problem and break it into distinct steps so one knob or one step is to figure out how to define a metric they captures what you want to do and I will is separately about how to actually do well on this metric so I think of the machine learning tasks as two distinct steps to use the target analogy the first step is to place the target so define where you want to aim and then it's a completely separate step so this is one knife you could tune which is how do you place the target and it's completely separate problem think of it as a separate knob to tune in terms of how to do well at a target or how to you know aim accurately or how to shoot at the target and so defining a metric is step one and you do something else for step two so in terms of shooting the target maybe your learning algorithm is optimizing some cost function that looks like this are you minimizing this you know some losses on your training set so one thing you could do is to also modify this in order to incorporate these weights and maybe you end up changing this normalization constant as well suggest one over sum of WI between the details of how you define J is important but the point was with the philosophy for formalization think of placing the target that's one step and aiming and shooting the target as a distinct step which you do separately so so in other words I encourage you to think of defining the metrics as one step and then you can and only after you've defined in metric figure out how to do well on that metric which might be changing the cost function J that your network is optimizing but for going on let's look at just one more example let's say that you're two cat classifiers a and B have respectively 3 percent error and 5 percent error as evaluated on your death set or maybe even on your test set which are images downloaded off the internet so high quality well frame images but maybe we need to tour your album a product you find that averin B actually looks like it's performing better even though it's doing better or your death set and you find that you've been training off very nice high quality images download without the internet but when you deploy this on a mobile app users are uploading all sorts of pictures much less trained you're going public the Catholic has twenty fish expressions maybe images are much blurrier and when you test out your algorithms you find that a from B is actually doing better so this would be another example of your metrics and depth test set falling down and the problem is that you're evaluating on a different test set a very nice high resolution well fairing images but what your users really care about is the album doing well on images they're uploading which are maybe less professional shots and blurrier and less wall frames so the guideline is they're doing well on your metrics and your current deficit or different test sets distribution if that does not correspond to doing well on the application you actually care about then change your metrics and/or your depth test set so in other words you can discover that your dev test set has these very high quality images but evaluating on this dev test set is not predictive of how well your app actually performs because you asked me to view of lower quality images then that's a good time to change your dev test set so that your data better reflects the type of data you actually need to do well on but your vote guideline is really if your current metrics and data and your validating on doesn't correspond to do well and what you actually care about and change your metric and/or your depth test set submit to capture what you need the album to actually do well on having the evaluation metrics and the death set allows you to much more quickly make decisions about this album a or ever be better so it really speeds up how quickly you or your team can iterate so my recommendation is even if you can't define the perfect evaluation metric and DEF set just set something up quickly and use that to drive the speed of your team iterating and then later down the line you find out that it wasn't a good one you're a better idea change it at that time it's perfectly okay but what I recommend it so most teams is to run for too long without an evaluation metric and depth set because that can slow down the efficiency or who should team can iterate and improve your algorithm so that's it on when to change your evaluation measures and/or depth and test sets hope that these guidelines help you serve your whole team to have a well-defined target you can iterate efficiently towards improving performance on\n"