Training Data Split Strategies for Machine Learning
When it comes to machine learning, one of the most important steps is preparing your data for training. However, getting started with this can be daunting, especially when considering how to split your data into training, development, and testing sets. In this article, we will explore some common strategies for splitting your data, including the advantages and disadvantages of each approach.
One popular strategy for splitting data is to use a 80-20 rule, where 80% of the data is used for training, 10% for development, and 10% for testing. This means that if you have 1000 images in total, your training set would consist of 800 images, your dev set 100 images, and your test set 100 images. However, some argue that this strategy may not be ideal, as it can lead to overfitting.
A better approach is to split your data into training, development, and testing sets based on the type of data you are working with. For example, if you have a large dataset from the web, but also want to test your model on images from mobile apps, you can split your data as follows: 200,000 images from the web for training, 5,000 images from mobile apps for both development and testing. This approach allows you to tailor your training set to the specific distribution of images you care about.
For instance, imagine building a speech-activated rearview mirror for a car. You can collect data from various sources such as speech recognition applications, voice assistants, or even purchase audio clips and transcripts from vendors. To train a model on this specific problem, you would split your dataset into training, development, and testing sets based on the type of data. For example, you could use half a million utterances from other speech recognition tasks for training, 10,000 utterances from speech-activated rearview mirror for both development and testing.
The advantages of this approach are that you can tailor your model to the specific distribution of images or data you care about. By using more data from the relevant source, such as speech-activated rearview mirrors, you can train a model that performs better on that specific task. This is because the training set will be more representative of the real-world scenario, allowing your model to generalize better.
Another advantage of this approach is that it allows you to address potential issues with data distribution. If your training set comes from a different distribution than your dev and test sets, it can lead to biased models that perform well on one dataset but poorly on another. By splitting your data into training, development, and testing sets based on the type of data, you can mitigate this issue and ensure that your model is more robust.
However, there are also some disadvantages to this approach. For example, if your dev and test sets come from a different distribution than your training set, it may not be ideal for training. In this case, you may need to use additional techniques to address the mismatch between the two datasets.
Ultimately, the best strategy for splitting your data will depend on the specific problem you are trying to solve and the type of data you have available. While there is no one-size-fits-all approach, using a combination of different strategies can help ensure that your model is robust and performs well on unseen data.
"WEBVTTKind: captionsLanguage: endeep learning algorithms have a huge hunger for training data they just often work best we can find enough labor training data to put into the training center this is resulted in many teams sometimes taking one of the days you can find and just shoving it into the training set just to get more training data even as some of this data or even if maybe a lot of this data doesn't come from the same distribution as your def and test data so in a deep learning error more and more teams are now training on data that comes from a different distribution than your depth and test set and there's some subtleties and some best practices but dealing with when you train and test distributions differ from each other let's take a look not say that you're building a mobile app where users will upload pictures take from the cell phones and you want to recognize whether the picture is that your users upload from the mobile app is a cat or not so you can now get two sources of data one which is the distribution of data you really care about this data from a mobile app like that on the right which tends to be less professionally shots less warframe maybe the blurrier to the short pipe you know amateur users the other source of data you can get is you can crawl the web and just download a lot of them for the sake of this example let's you can download a lot of very professional trains high-resolution professionally taken images of cats and let's say you don't have a lot of users yet for your mobile apps so maybe you've gotten you know 10,000 pictures uploaded from the mobile app but by crawling the web you can download huge numbers of cat pictures and maybe you have 200,000 pictures of cats downloaded off the internet so what you really care about is that your final system does well on the mobile app distribution of images right because in the end your users be uploading pictures like those on the right and you need your classifier to do well on that but you now have a bit of a dilemma because you have a relatively small data set just 10,000 examples drawn from that distribution and you have a much bigger data set that's drawn from a different distribution a different appearance of image than the one you actually want so you don't want to use just those 10,000 images because it ends up giving you a relatively small training set and using this 200,000 images seems helpful but you know the dilemma is this 200,000 images isn't from exactly the distribution you want so what can you do well here's one option one thing to do is put both of these datasets together so you now have two hundred and ten thousand images and you can then take the two hundred ten thousand images and randomly shuffle them into a train death and test set and let's say for the sake of argument that you decided that your death in test says will be two thousand five hundred examples each so your training sets will be two hundred and five thousand examples now so some of your data this way it has some advantages but also disadvantages the advantage is that now your training depth and test sets will all come from the same distribution so that makes it easier to manage but the disadvantage and this is a huge disadvantage is that if you look at your def set of this 2500 examples a lot of it will come from the web page distribution of images rather than what you actually care about which is the mobile app distribution of images so it turns out that of your total amount of data two hundred thousand rebate at 200k out of two hundred and ten thousand to the regular 210 K that comes from webpages so all of this 2500 examples on expectation I think two hundred two thousand three hundred and eighty one of them will come from webpages this is on expectation the exact number will vary around this depending on how the random shuffle operation went but on average only a hundred and nineteen will come from mobile app up so remember that said you have the death set is telling your team where to aim to target and the way your England target you're saying spend most of the time optimizing for the webpage distribution of images which is really not what you want so I would recommend against option one because this is setting up the death set to tell your team to optimize for a different distribution of data than what you actually care about so instead of doing this I would recommend that you instead take another option which is the following the training set let's say it's still two hundred five thousand images I would have the training set have two hundred thousand all two hundred thousand images from the web and then you can if you want add in five thousand images from the mobile app and then for your dev and test sets I guess my data sets eyes on terms of scale your dev and test sets would be all mobile app images so the training set will include two hundred thousand images from the web and five thousand from the mobile app the death set will be 2500 images from the mobile app and the test set will be 2500 images also from the mobile app the advantage of this way of splitting up your data to Train dev and test is that you're now aiming the target where you want it to be you're telling your team my dev set has data uploaded from the mobile app and that's the distribution of images you really care about so let's try to build a machine learning system that does really well on the mobile app distribution of images the disadvantage of course is that now your training distribution is different from your dev intensity distributions but it turns out that this split of your data into training dev and test will get you better performance over the long term and we'll discuss later some specific techniques for dealing with your training set coming from different distributions and your dev and test sets let's look at another example let's say you're building a brand new product a speech activated rear view mirror for a call so this is a real product in China is making its way into other countries but you can build a rear view mirror to replace this little thing there so that you can now talk to the rear view mirror and basically say dear rear view mirror please help me find navigational directions and yours gas station and a view with it so this is actually a real problem and let's say you're trying to build this for your own country so how can you get data to train up a speech recognition system for this product well maybe if work on speech recognition for a long time so you have a lot of data from other speech recognition applications just not from a speech activated rear view mirror here's how you could split up your training and your deaf intestines so for a training you can take all the speech data you have that your accumulated from work on other speech problems such as data you purchased over the years from various speech recognition data vendors and today you can actually buy data from vendors of your X Y parents where X is an audio clip and Y is a transcript or maybe you've worked on smart speakers smart voice activated speakers so you have some data from that maybe it worked on voice activated keyboards and so on and for the sake of argument maybe you have half a million utterances from all of these sources and for your dev and test set maybe of a much smaller data set that actually came from a speech activated rear view mirror you know because users are asking for navigational queries or trying to find directions in these places this data set will maybe have a lot more street addresses right please help me navigate to this screen hydrogen so please help me navigate to this gas station so this distribution of data will be very different than these on the left but this is really the data you care about because this is what you need your product to do well on so this is what you send your death and then test that to be so what you do in this example is set your training set to be the five hundred thousand observances on the left and then your dev and test sets which I'll abbreviate D and T these could be maybe ten thousand other isms each that's drawn from actual the speech activated rearview mirror or alternatively if you think you don't need to put all twenty thousand examples from your speech activated rearview mirror into the dev and test sets maybe you can take half of that and put that in the training set so then the training set could be five hundred and ten thousand utterances including all five hundred from there and ten thousand from the rearview mirror and then the Deaf in test sets could maybe be on five thousand dollar and so huge so up to twenty thousand other answers may be ten K goes into the training set and five K into the death set in five K five thousand into the test set so this would be another reasonable way of splitting your data into trained deaf intense and this gives you a much bigger training set over five thousand other answers then you were to only use speech activated rearview mirror data for your training set so in this video you've seen a couple examples of when allowing your training set data to come from a different distribution then your Devon test set allows you to have much more training data and in these examples and it will cause your learning algorithms perform better now one question might ask is should always use all the data you have the answer is subtle there's not always yes let's look at the counter example in the next videodeep learning algorithms have a huge hunger for training data they just often work best we can find enough labor training data to put into the training center this is resulted in many teams sometimes taking one of the days you can find and just shoving it into the training set just to get more training data even as some of this data or even if maybe a lot of this data doesn't come from the same distribution as your def and test data so in a deep learning error more and more teams are now training on data that comes from a different distribution than your depth and test set and there's some subtleties and some best practices but dealing with when you train and test distributions differ from each other let's take a look not say that you're building a mobile app where users will upload pictures take from the cell phones and you want to recognize whether the picture is that your users upload from the mobile app is a cat or not so you can now get two sources of data one which is the distribution of data you really care about this data from a mobile app like that on the right which tends to be less professionally shots less warframe maybe the blurrier to the short pipe you know amateur users the other source of data you can get is you can crawl the web and just download a lot of them for the sake of this example let's you can download a lot of very professional trains high-resolution professionally taken images of cats and let's say you don't have a lot of users yet for your mobile apps so maybe you've gotten you know 10,000 pictures uploaded from the mobile app but by crawling the web you can download huge numbers of cat pictures and maybe you have 200,000 pictures of cats downloaded off the internet so what you really care about is that your final system does well on the mobile app distribution of images right because in the end your users be uploading pictures like those on the right and you need your classifier to do well on that but you now have a bit of a dilemma because you have a relatively small data set just 10,000 examples drawn from that distribution and you have a much bigger data set that's drawn from a different distribution a different appearance of image than the one you actually want so you don't want to use just those 10,000 images because it ends up giving you a relatively small training set and using this 200,000 images seems helpful but you know the dilemma is this 200,000 images isn't from exactly the distribution you want so what can you do well here's one option one thing to do is put both of these datasets together so you now have two hundred and ten thousand images and you can then take the two hundred ten thousand images and randomly shuffle them into a train death and test set and let's say for the sake of argument that you decided that your death in test says will be two thousand five hundred examples each so your training sets will be two hundred and five thousand examples now so some of your data this way it has some advantages but also disadvantages the advantage is that now your training depth and test sets will all come from the same distribution so that makes it easier to manage but the disadvantage and this is a huge disadvantage is that if you look at your def set of this 2500 examples a lot of it will come from the web page distribution of images rather than what you actually care about which is the mobile app distribution of images so it turns out that of your total amount of data two hundred thousand rebate at 200k out of two hundred and ten thousand to the regular 210 K that comes from webpages so all of this 2500 examples on expectation I think two hundred two thousand three hundred and eighty one of them will come from webpages this is on expectation the exact number will vary around this depending on how the random shuffle operation went but on average only a hundred and nineteen will come from mobile app up so remember that said you have the death set is telling your team where to aim to target and the way your England target you're saying spend most of the time optimizing for the webpage distribution of images which is really not what you want so I would recommend against option one because this is setting up the death set to tell your team to optimize for a different distribution of data than what you actually care about so instead of doing this I would recommend that you instead take another option which is the following the training set let's say it's still two hundred five thousand images I would have the training set have two hundred thousand all two hundred thousand images from the web and then you can if you want add in five thousand images from the mobile app and then for your dev and test sets I guess my data sets eyes on terms of scale your dev and test sets would be all mobile app images so the training set will include two hundred thousand images from the web and five thousand from the mobile app the death set will be 2500 images from the mobile app and the test set will be 2500 images also from the mobile app the advantage of this way of splitting up your data to Train dev and test is that you're now aiming the target where you want it to be you're telling your team my dev set has data uploaded from the mobile app and that's the distribution of images you really care about so let's try to build a machine learning system that does really well on the mobile app distribution of images the disadvantage of course is that now your training distribution is different from your dev intensity distributions but it turns out that this split of your data into training dev and test will get you better performance over the long term and we'll discuss later some specific techniques for dealing with your training set coming from different distributions and your dev and test sets let's look at another example let's say you're building a brand new product a speech activated rear view mirror for a call so this is a real product in China is making its way into other countries but you can build a rear view mirror to replace this little thing there so that you can now talk to the rear view mirror and basically say dear rear view mirror please help me find navigational directions and yours gas station and a view with it so this is actually a real problem and let's say you're trying to build this for your own country so how can you get data to train up a speech recognition system for this product well maybe if work on speech recognition for a long time so you have a lot of data from other speech recognition applications just not from a speech activated rear view mirror here's how you could split up your training and your deaf intestines so for a training you can take all the speech data you have that your accumulated from work on other speech problems such as data you purchased over the years from various speech recognition data vendors and today you can actually buy data from vendors of your X Y parents where X is an audio clip and Y is a transcript or maybe you've worked on smart speakers smart voice activated speakers so you have some data from that maybe it worked on voice activated keyboards and so on and for the sake of argument maybe you have half a million utterances from all of these sources and for your dev and test set maybe of a much smaller data set that actually came from a speech activated rear view mirror you know because users are asking for navigational queries or trying to find directions in these places this data set will maybe have a lot more street addresses right please help me navigate to this screen hydrogen so please help me navigate to this gas station so this distribution of data will be very different than these on the left but this is really the data you care about because this is what you need your product to do well on so this is what you send your death and then test that to be so what you do in this example is set your training set to be the five hundred thousand observances on the left and then your dev and test sets which I'll abbreviate D and T these could be maybe ten thousand other isms each that's drawn from actual the speech activated rearview mirror or alternatively if you think you don't need to put all twenty thousand examples from your speech activated rearview mirror into the dev and test sets maybe you can take half of that and put that in the training set so then the training set could be five hundred and ten thousand utterances including all five hundred from there and ten thousand from the rearview mirror and then the Deaf in test sets could maybe be on five thousand dollar and so huge so up to twenty thousand other answers may be ten K goes into the training set and five K into the death set in five K five thousand into the test set so this would be another reasonable way of splitting your data into trained deaf intense and this gives you a much bigger training set over five thousand other answers then you were to only use speech activated rearview mirror data for your training set so in this video you've seen a couple examples of when allowing your training set data to come from a different distribution then your Devon test set allows you to have much more training data and in these examples and it will cause your learning algorithms perform better now one question might ask is should always use all the data you have the answer is subtle there's not always yes let's look at the counter example in the next video\n"