The Importance of Prioritizing Data Tags in Machine Learning
In machine learning, prioritizing data tags is crucial to improve the performance of an algorithm. One example was previously discussed where four tags were used: accuracy of the algorithm, human level performance, and what's the gap between the current accuracy and human level performance. To prioritize where to focus attention, we need to analyze the percentage of data with each tag.
For instance, let's say that sixty percent of your data is clean speech, four percent is dated with car noise, thirty percent has people noise, and six percent is low bandwidth audio. This tells us that if we could take clean speech and raise our accuracy from 94 to 95 on all the clean speech, then multiplying 1 with 60 would tell us that if we could improve our performance on clear speech to human level performance, our overall speech system would be 0.6 percent more accurate because we would do 1 better on 60 of the data. So, this will raise average accuracy by 0.6 percent on car noise if we can improve the performance by four percent on four percent of the data.
Multiplying these out gives us a 0.16 percent improvement and multiplying them out as well gives us essentially zero percent because we can't make that any better. This slightly richer analysis shows that because people noise accounts for such a large fraction of the data, it may be more worthwhile to work on either people noise or maybe on clean speech because there's actually larger potential for improvements in both of those than for speech with car noise.
When prioritizing what to work on, you might decide on the most important categories based on how much room for improvement there is. This could be compared to human level performance or according to some baseline comparison. You can also take into account how easy it is to improve accuracy in that category. For example, if you have some ideas for improving the accuracy of speech with car noise, maybe your data augmentation would cause you to prioritize that category more highly than some other category where you just don't have as many ideas for how to improve the system.
Additionally, consider how important it is to improve performance on that category. For instance, if you decide that improving performance with car noise is especially important because when you're driving, you want to do search and find addresses without having to use your hands if your hands are supposed to be holding the steering wheel. There is no mathematical formula that will tell you what to work on, but by looking at these factors, you hope to make more fruitful decisions once you've decided that you want to work on one category of data, say data with car noise.
To focus on improving your data on the tags that you have determined are most fruitful for you to work on, you can be much more efficient in how you improve your learning algorithms' performance. This type of error analysis procedure is very useful for many projects and will help you too in building production-ready machine learning systems.
Managing Skewed Data Sets
Another challenge in machine learning is managing skewed data sets. Skewed data sets can lead to biased models that don't perform well on unseen data. To manage skewed data sets, we need to understand the underlying causes of skewness and take steps to mitigate it.
Let's say you have a dataset with an imbalanced number of instances for each class. For example, let's say you have 1000 instances of one class and only 10 instances of another class. In this case, the minority class has a much smaller number of instances than the majority class, which can lead to biased models.
To manage skewed data sets, we need to analyze the percentage of data with each tag. For instance, let's say that sixty percent of your data is clean speech, four percent is dated with car noise, thirty percent has people noise, and six percent is low bandwidth audio. This tells us that if we could take clean speech and raise our accuracy from 94 to 95 on all the clean speech, then multiplying 1 with 60 would tell us that if we could improve our performance on clear speech to human level performance, our overall speech system would be 0.6 percent more accurate because we would do 1 better on 60 of the data.
Multiplying these out gives us a 0.16 percent improvement and multiplying them out as well gives us essentially zero percent because we can't make that any better. This slightly richer analysis shows that because people noise accounts for such a large fraction of the data, it may be more worthwhile to work on either people noise or maybe on clean speech because there's actually larger potential for improvements in both of those than for speech with car noise.
When managing skewed data sets, we need to consider how much room for improvement there is. We also need to take into account how easy it is to improve accuracy in that category and how important it is to improve performance on that category. For instance, if you decide that improving performance with car noise is especially important because when you're driving, you want to do search and find addresses without having to use your hands if your hands are supposed to be holding the steering wheel. There is no mathematical formula that will tell you what to work on, but by looking at these factors, you hope to make more fruitful decisions once you've decided that you want to work on one category of data, say data with car noise.
To focus on improving your data on the tags that you have determined are most fruitful for you to work on, you can be much more efficient in how you improve your learning algorithms' performance. This type of error analysis procedure is very useful for many projects and will help you too in building production-ready machine learning systems.