#20 Machine Learning Engineering for Production (MLOps) Specialization [Course 1, Week 2, Lesson 12]
**Using Data Augmentation to Improve Machine Learning Performance**
One common approach to improving machine learning performance is through data augmentation. However, generating new data can be an inefficient process, as it often requires retraining the algorithm on the new data. To address this issue, it's essential to develop a systematic approach to generating new data that meets specific criteria.
**Verifying Data Augmentation Criteria**
When generating new data using data augmentation, there are several key criteria to verify that ensure the generated data is useful for improving machine learning performance. The first criterion is whether the new data sounds realistic. For audio-based applications, this means that the generated data should sound like it was recorded in a real-world setting, rather than sounding artificial or unnatural. Additionally, the x-to-y mapping should be clear and understandable, with humans able to recognize what was said. This helps ensure that the algorithm is trained on data that accurately represents the problem at hand.
**Ensuring Data Quality**
To verify these criteria, it's essential to perform sanity checks on the generated data. For audio-based applications, this may involve playing back the new data and assessing its quality. In contrast, for image-based applications, this may involve visually inspecting the new images to ensure they are realistic and accurately represent the problem at hand.
**Identifying Areas Where the Algorithm Struggles**
Once the new data has been generated and verified, it's essential to identify areas where the algorithm struggles with it. This helps determine whether the new data is useful for improving performance. By analyzing the performance of the algorithm on both the original data and the new data, it's possible to pinpoint specific areas where the algorithm is struggling.
**Using Data Augmentation Effectively**
If the generated data meets these criteria, it can be an effective way to improve machine learning performance. By adding new data that sounds realistic and is challenging for the algorithm, it's possible to improve performance in a targeted and efficient manner. Additionally, using data augmentation techniques such as flipping images or applying contrast changes can help generate more examples that meet the desired criteria.
**Examples of Data Augmentation Techniques**
One example of a data augmentation technique used on audio-based applications is image-to-image translation. This involves taking an image of a smartphone with scratches and using Photoshop to artificially draw additional scratches. This technique can be effective for generating realistic examples, but may require careful tuning to ensure the generated images are not too dark or distorted.
**Using GANs and Model Iteration**
For more advanced applications, techniques such as generative adversarial networks (GANs) can be used to synthesize new data. However, these techniques can also be overkill, requiring significant computational resources and expertise. In contrast, using a data iteration loop approach, where the algorithm is repeatedly trained on new data and evaluated for performance, can provide faster improvement in performance.
**The Impact of Adding New Data**
When adding new data to an existing system, it's essential to consider whether this will hurt or help the machine learning algorithm's performance. In general, for unstructured data problems, adding new data is unlikely to harm performance. However, there are some caveats to be aware of, particularly when working with large datasets or complex algorithms.
**Conclusion**
Data augmentation can be an effective way to improve machine learning performance, particularly when used in conjunction with careful verification and analysis. By generating new data that meets specific criteria, such as sounding realistic and being challenging for the algorithm, it's possible to improve performance in a targeted and efficient manner. However, it's essential to carefully consider the impact of adding new data on existing algorithms and systems.
"WEBVTTKind: captionsLanguage: endata augmentation can be a very efficient way to get more data especially for unstructured data problems such as images audio maybe text but when carrying out data augmentation there are a lot of choices you have to make what are the parameters how do you design the data augmentation setup let's dive into this to look at some best practices take speech recognition given an audio clip like this ai is the new electricity if you take background cafe noise it sounds like this and add these two audio clips together literally take the two waveforms and sum them up then you can create a synthetic example that sounds like this ai is the new electricity so it sounds like someone's saying ai is a new electricity in a noisy cafe this is one form of data augmentation that lets you efficiently create a lot of data that sounds like data collected in the cafe or if you take the same audio ai as the new electricity and add it to background music then it sounds like someone's saying it with maybe the radio on in the background ai is the new electricity now when carrying out data augmentation there are few decisions you need to make what types of background noise should you use and how loud should the background noise be relative to the speech let's take a look at some ways of making these decisions systematically the goal of data augmentation is to create examples that your learning algorithm can learn from as a framework for doing that i encourage you to think of how you can create realistic examples that the algorithm does poorly on because if the algorithm already does well on those examples then there's less for it to learn from but you want the examples to still be ones that a human or maybe some of the baseline can do well on because otherwise one way to generate examples that the algorithm does poorly on would be to just create examples that are so noisy that no one can hear what anyone said but that's not helpful you want examples that are hard enough to challenge the algorithm but not so hard that they're impossible for any human or any algorithm to ever do well on and that's why when i am generating new examples using data augmentation i try to generate examples that meets both of these criteria now one way that some people do data augmentation is to generate an augmented data set and then train the learning algorithm and see if the algorithm does better on the data set and then fiddle around with the parameters for data augmentation and train the learning algorithm again and so on this turns out to be quite inefficient because every time you change your data augmentation parameters you need to train your neural network or train your learning algorithm all over and this can take a long time instead i found that using these principles allows you to sanity check that your new data generated using data augmentation is useful without actually having to spend maybe hours or sometimes days of training a learning algorithm on that data to verify that it will result in the performance improvement so specifically here's a checklist you might go through when you are generating new data one does it sound realistic you want your audio to actually sound like realistic audio of the sort that you want your algorithm to perform on two is the x to y mapping clear in other words can humans still recognize what was said this is to verify point two here and three is the algorithm currently doing poorly on this new data and that helps you verify points one if you can generate data that means all of these criteria then that will give you a higher chance that when you put this data into your training set and retrain the algorithm that that will result in you successfully pulling up part of this rubber sheet let's look at one more example using images this time let's say that you have a very small set of images of smartphones with scratches here's how you may be able to use data augmentation you can take the image and flip it horizontally this results in a pretty realistic image the phone buttons are now on the other side but this could be a useful example to add to your training set or you could implement contrast changes uh i've actually brightened up the image here so the scratch is a little bit more visible or you could try darkening the image but in this example the image is now so dark that even i as a person can't really tell if there's a scratch there or not and so whereas these two examples on top would pass the checklist we had earlier that the human can still detect the scratch well this example is too dark it would fail that checklist and so i would try to choose a data augmentation scheme that generates more examples that look like the ones on top and few of the ones that look like the ones here at the bottom and in fact going off the principle that we want images that look realistic that humans can do well on and hopefully the album does poorly on you can also use more subscripted techniques such as take a picture of the phone with no scratches and use photoshop in order to artificially draw a scratch and this technique literally using photoshop can also be an effective way to generate more examples because this example with a scratch here you may or may not be able to see it depending on the video compression and image contrast where you're watching this video but with a scratch here this looks like a pretty realistic scratch this is actually generated a photoshop and i as a person can recognize the scratch and so if the learning algorithm isn't detecting this right now this would be a great example to add i've also used more advanced techniques like gans generative adversarial networks to synthesize scratches like these automatically although i've found that techniques like that can also be overkill meaning that the simpler techniques are much faster to implement that work just fine without the complexity of building again to synthesize scratches you may have heard of the term model iteration which refers to iteratively training a model using our analysis and then trying to decide how to improve the model taking a data centric approach to ai development sometimes it's useful to instead use a data iteration loop where you repeatedly take the data and the model train your learning algorithm do error analysis and as you go through this loop focus on how to add data or improve the quality of the data and for many practical applications taking this data iteration loop approach with a robust hyper parameter search that's important too but taking a data iteration loop approach results in faster improvement to your learning album's performance depending on your problem so when you're working on an unstructured data problem data augmentation if you can create new data that seems realistic that humans can do quite well on but the album struggles on that can be an efficient way to improve your learning algorithm's performance and so if you found through error analysis that your learning algorithm does poorly on speech with cafe noise data augmentation to generate more data with caffeine noise could be an efficient way to improve your learning album's performance now when you add data to your system question i've often often been asked is can adding data hurt your learning album's performance usually for unstructured data performance the answer is no with some caveats but let's dive more deeply into this in the next videodata augmentation can be a very efficient way to get more data especially for unstructured data problems such as images audio maybe text but when carrying out data augmentation there are a lot of choices you have to make what are the parameters how do you design the data augmentation setup let's dive into this to look at some best practices take speech recognition given an audio clip like this ai is the new electricity if you take background cafe noise it sounds like this and add these two audio clips together literally take the two waveforms and sum them up then you can create a synthetic example that sounds like this ai is the new electricity so it sounds like someone's saying ai is a new electricity in a noisy cafe this is one form of data augmentation that lets you efficiently create a lot of data that sounds like data collected in the cafe or if you take the same audio ai as the new electricity and add it to background music then it sounds like someone's saying it with maybe the radio on in the background ai is the new electricity now when carrying out data augmentation there are few decisions you need to make what types of background noise should you use and how loud should the background noise be relative to the speech let's take a look at some ways of making these decisions systematically the goal of data augmentation is to create examples that your learning algorithm can learn from as a framework for doing that i encourage you to think of how you can create realistic examples that the algorithm does poorly on because if the algorithm already does well on those examples then there's less for it to learn from but you want the examples to still be ones that a human or maybe some of the baseline can do well on because otherwise one way to generate examples that the algorithm does poorly on would be to just create examples that are so noisy that no one can hear what anyone said but that's not helpful you want examples that are hard enough to challenge the algorithm but not so hard that they're impossible for any human or any algorithm to ever do well on and that's why when i am generating new examples using data augmentation i try to generate examples that meets both of these criteria now one way that some people do data augmentation is to generate an augmented data set and then train the learning algorithm and see if the algorithm does better on the data set and then fiddle around with the parameters for data augmentation and train the learning algorithm again and so on this turns out to be quite inefficient because every time you change your data augmentation parameters you need to train your neural network or train your learning algorithm all over and this can take a long time instead i found that using these principles allows you to sanity check that your new data generated using data augmentation is useful without actually having to spend maybe hours or sometimes days of training a learning algorithm on that data to verify that it will result in the performance improvement so specifically here's a checklist you might go through when you are generating new data one does it sound realistic you want your audio to actually sound like realistic audio of the sort that you want your algorithm to perform on two is the x to y mapping clear in other words can humans still recognize what was said this is to verify point two here and three is the algorithm currently doing poorly on this new data and that helps you verify points one if you can generate data that means all of these criteria then that will give you a higher chance that when you put this data into your training set and retrain the algorithm that that will result in you successfully pulling up part of this rubber sheet let's look at one more example using images this time let's say that you have a very small set of images of smartphones with scratches here's how you may be able to use data augmentation you can take the image and flip it horizontally this results in a pretty realistic image the phone buttons are now on the other side but this could be a useful example to add to your training set or you could implement contrast changes uh i've actually brightened up the image here so the scratch is a little bit more visible or you could try darkening the image but in this example the image is now so dark that even i as a person can't really tell if there's a scratch there or not and so whereas these two examples on top would pass the checklist we had earlier that the human can still detect the scratch well this example is too dark it would fail that checklist and so i would try to choose a data augmentation scheme that generates more examples that look like the ones on top and few of the ones that look like the ones here at the bottom and in fact going off the principle that we want images that look realistic that humans can do well on and hopefully the album does poorly on you can also use more subscripted techniques such as take a picture of the phone with no scratches and use photoshop in order to artificially draw a scratch and this technique literally using photoshop can also be an effective way to generate more examples because this example with a scratch here you may or may not be able to see it depending on the video compression and image contrast where you're watching this video but with a scratch here this looks like a pretty realistic scratch this is actually generated a photoshop and i as a person can recognize the scratch and so if the learning algorithm isn't detecting this right now this would be a great example to add i've also used more advanced techniques like gans generative adversarial networks to synthesize scratches like these automatically although i've found that techniques like that can also be overkill meaning that the simpler techniques are much faster to implement that work just fine without the complexity of building again to synthesize scratches you may have heard of the term model iteration which refers to iteratively training a model using our analysis and then trying to decide how to improve the model taking a data centric approach to ai development sometimes it's useful to instead use a data iteration loop where you repeatedly take the data and the model train your learning algorithm do error analysis and as you go through this loop focus on how to add data or improve the quality of the data and for many practical applications taking this data iteration loop approach with a robust hyper parameter search that's important too but taking a data iteration loop approach results in faster improvement to your learning album's performance depending on your problem so when you're working on an unstructured data problem data augmentation if you can create new data that seems realistic that humans can do quite well on but the album struggles on that can be an efficient way to improve your learning algorithm's performance and so if you found through error analysis that your learning algorithm does poorly on speech with cafe noise data augmentation to generate more data with caffeine noise could be an efficient way to improve your learning album's performance now when you add data to your system question i've often often been asked is can adding data hurt your learning album's performance usually for unstructured data performance the answer is no with some caveats but let's dive more deeply into this in the next video\n"