#15 Machine Learning Engineering for Production (MLOps) Specialization [Course 1, Week 2, Lesson 7]

The Importance of Prioritizing Data Tags in Machine Learning

In machine learning, prioritizing data tags is crucial to improve the performance of an algorithm. One example was previously discussed where four tags were used: accuracy of the algorithm, human level performance, and what's the gap between the current accuracy and human level performance. To prioritize where to focus attention, we need to analyze the percentage of data with each tag.

For instance, let's say that sixty percent of your data is clean speech, four percent is dated with car noise, thirty percent has people noise, and six percent is low bandwidth audio. This tells us that if we could take clean speech and raise our accuracy from 94 to 95 on all the clean speech, then multiplying 1 with 60 would tell us that if we could improve our performance on clear speech to human level performance, our overall speech system would be 0.6 percent more accurate because we would do 1 better on 60 of the data. So, this will raise average accuracy by 0.6 percent on car noise if we can improve the performance by four percent on four percent of the data.

Multiplying these out gives us a 0.16 percent improvement and multiplying them out as well gives us essentially zero percent because we can't make that any better. This slightly richer analysis shows that because people noise accounts for such a large fraction of the data, it may be more worthwhile to work on either people noise or maybe on clean speech because there's actually larger potential for improvements in both of those than for speech with car noise.

When prioritizing what to work on, you might decide on the most important categories based on how much room for improvement there is. This could be compared to human level performance or according to some baseline comparison. You can also take into account how easy it is to improve accuracy in that category. For example, if you have some ideas for improving the accuracy of speech with car noise, maybe your data augmentation would cause you to prioritize that category more highly than some other category where you just don't have as many ideas for how to improve the system.

Additionally, consider how important it is to improve performance on that category. For instance, if you decide that improving performance with car noise is especially important because when you're driving, you want to do search and find addresses without having to use your hands if your hands are supposed to be holding the steering wheel. There is no mathematical formula that will tell you what to work on, but by looking at these factors, you hope to make more fruitful decisions once you've decided that you want to work on one category of data, say data with car noise.

To focus on improving your data on the tags that you have determined are most fruitful for you to work on, you can be much more efficient in how you improve your learning algorithms' performance. This type of error analysis procedure is very useful for many projects and will help you too in building production-ready machine learning systems.

Managing Skewed Data Sets

Another challenge in machine learning is managing skewed data sets. Skewed data sets can lead to biased models that don't perform well on unseen data. To manage skewed data sets, we need to understand the underlying causes of skewness and take steps to mitigate it.

Let's say you have a dataset with an imbalanced number of instances for each class. For example, let's say you have 1000 instances of one class and only 10 instances of another class. In this case, the minority class has a much smaller number of instances than the majority class, which can lead to biased models.

To manage skewed data sets, we need to analyze the percentage of data with each tag. For instance, let's say that sixty percent of your data is clean speech, four percent is dated with car noise, thirty percent has people noise, and six percent is low bandwidth audio. This tells us that if we could take clean speech and raise our accuracy from 94 to 95 on all the clean speech, then multiplying 1 with 60 would tell us that if we could improve our performance on clear speech to human level performance, our overall speech system would be 0.6 percent more accurate because we would do 1 better on 60 of the data.

Multiplying these out gives us a 0.16 percent improvement and multiplying them out as well gives us essentially zero percent because we can't make that any better. This slightly richer analysis shows that because people noise accounts for such a large fraction of the data, it may be more worthwhile to work on either people noise or maybe on clean speech because there's actually larger potential for improvements in both of those than for speech with car noise.

When managing skewed data sets, we need to consider how much room for improvement there is. We also need to take into account how easy it is to improve accuracy in that category and how important it is to improve performance on that category. For instance, if you decide that improving performance with car noise is especially important because when you're driving, you want to do search and find addresses without having to use your hands if your hands are supposed to be holding the steering wheel. There is no mathematical formula that will tell you what to work on, but by looking at these factors, you hope to make more fruitful decisions once you've decided that you want to work on one category of data, say data with car noise.

To focus on improving your data on the tags that you have determined are most fruitful for you to work on, you can be much more efficient in how you improve your learning algorithms' performance. This type of error analysis procedure is very useful for many projects and will help you too in building production-ready machine learning systems.

"WEBVTTKind: captionsLanguage: enin the last video you learned about brainstorming and tagging your data with different attributes let's see how you can use this to prioritize where to focus your attention here's the example we had previously with four tags and the accuracy of the algorithm human level performance and what's the gap between the current accuracy and human level performance rather than deciding to work on car noise because the gap to hlp is biggest one other useful factor to look at is what's the percentage of data with that tag let's say that sixty percent of your data is clean speech four percent is dated with car noise 30 has people noise and six percent is low bandwidth audio this tells us that if we could take clean speech and raise our accuracy from 94 to 95 on all the clean speech then multiplying 1 with 60 this tells us that if we could improve our performance on clear speech to human level performance our overall speech system would be 0.6 percent more accurate because we would do 1 better on 60 of the data so this will raise average accuracy by 0.6 percent on car noise if we can improve the performance by four percent on four percent of the data multiplying that out that gives us a 0.16 percent improvement and multiplying these out as well we get 0.6 percent and well this is essentially zero percent because we can't make that any better and so whereas previously we had said there's a lot of room for improvement in car noise in this slightly richer analysis we see that because people noise accounts for such a large fraction of the data it may be more worthwhile to work on either people noise or maybe on clean speech because there's actually larger potential for improvements in both of those than for speech with car noise so to summarize when prioritizing what to work on you might decide on the most important categories to work on based on how much room for improvement there is such as compared to human level performance or according to some baseline comparison how frequently does that carry appear you can also take into account how easy it is to improve accuracy in that category for example if you have some ideas for how to improve the accuracy of speech with car noise maybe your data augmentation that might cause you to prioritize that category more highly than some other category where you just don't have as many ideas for how to improve the system and then finally how important it is to improve performance on that category for example you may decide that improving performance with car noise is especially important because when you're driving you have a stronger desire to do search especially search on maps and find addresses without needing to use your hands if your hands are supposed to be holding the steering wheel there is no mathematical formula that will tell you what to work on but by looking at these factors i hope you'll be able to make more fruitful decisions once you've decided that you want to work on one category of data say data with car noise once you've decided that there's a category or maybe a few categories where you want to improve the album's performance one fruitful approach is to consider adding data or improving the quality of that data for that one or maybe a small handful of categories so for example if you want to improve performance on speech with car noise you might go out and collect more data with car noise or if you have a way of using data augmentation to get more data from that category that would be another way to improve your album's performance one topic that we'll discuss next week is how to improve label accuracy or data quality you learn more about this when we talk about the data phase of the machine learning project life cycle in machine learning we always would like to have more data but going out to collect more data generically can be very time consuming and expensive by carrying out an analysis like this when you are then going through this iterative process of improving your learning algorithm you can be much more focused in exactly what types of data you collect because if you decide to collect more data with car noise or maybe people noise you can be much more specific in going out to collect more of just that data or using data augmentation without wasting time trying to collect more data from a low bandwidth cell phone connection and this focus on improving your data on the tags that you have determined are most fruitful for you to work on that can help you be much more efficient in how you improve your learning algorithms performance i found this type of error analysis procedure very useful for many of my projects and i hope it will help you too in building production ready machine learning systems next one of the most common challenges we run into is skewed data sets let's go on to the next video to go through some techniques for managing skewed data setsin the last video you learned about brainstorming and tagging your data with different attributes let's see how you can use this to prioritize where to focus your attention here's the example we had previously with four tags and the accuracy of the algorithm human level performance and what's the gap between the current accuracy and human level performance rather than deciding to work on car noise because the gap to hlp is biggest one other useful factor to look at is what's the percentage of data with that tag let's say that sixty percent of your data is clean speech four percent is dated with car noise 30 has people noise and six percent is low bandwidth audio this tells us that if we could take clean speech and raise our accuracy from 94 to 95 on all the clean speech then multiplying 1 with 60 this tells us that if we could improve our performance on clear speech to human level performance our overall speech system would be 0.6 percent more accurate because we would do 1 better on 60 of the data so this will raise average accuracy by 0.6 percent on car noise if we can improve the performance by four percent on four percent of the data multiplying that out that gives us a 0.16 percent improvement and multiplying these out as well we get 0.6 percent and well this is essentially zero percent because we can't make that any better and so whereas previously we had said there's a lot of room for improvement in car noise in this slightly richer analysis we see that because people noise accounts for such a large fraction of the data it may be more worthwhile to work on either people noise or maybe on clean speech because there's actually larger potential for improvements in both of those than for speech with car noise so to summarize when prioritizing what to work on you might decide on the most important categories to work on based on how much room for improvement there is such as compared to human level performance or according to some baseline comparison how frequently does that carry appear you can also take into account how easy it is to improve accuracy in that category for example if you have some ideas for how to improve the accuracy of speech with car noise maybe your data augmentation that might cause you to prioritize that category more highly than some other category where you just don't have as many ideas for how to improve the system and then finally how important it is to improve performance on that category for example you may decide that improving performance with car noise is especially important because when you're driving you have a stronger desire to do search especially search on maps and find addresses without needing to use your hands if your hands are supposed to be holding the steering wheel there is no mathematical formula that will tell you what to work on but by looking at these factors i hope you'll be able to make more fruitful decisions once you've decided that you want to work on one category of data say data with car noise once you've decided that there's a category or maybe a few categories where you want to improve the album's performance one fruitful approach is to consider adding data or improving the quality of that data for that one or maybe a small handful of categories so for example if you want to improve performance on speech with car noise you might go out and collect more data with car noise or if you have a way of using data augmentation to get more data from that category that would be another way to improve your album's performance one topic that we'll discuss next week is how to improve label accuracy or data quality you learn more about this when we talk about the data phase of the machine learning project life cycle in machine learning we always would like to have more data but going out to collect more data generically can be very time consuming and expensive by carrying out an analysis like this when you are then going through this iterative process of improving your learning algorithm you can be much more focused in exactly what types of data you collect because if you decide to collect more data with car noise or maybe people noise you can be much more specific in going out to collect more of just that data or using data augmentation without wasting time trying to collect more data from a low bandwidth cell phone connection and this focus on improving your data on the tags that you have determined are most fruitful for you to work on that can help you be much more efficient in how you improve your learning algorithms performance i found this type of error analysis procedure very useful for many of my projects and i hope it will help you too in building production ready machine learning systems next one of the most common challenges we run into is skewed data sets let's go on to the next video to go through some techniques for managing skewed data sets\n"