R tutorial - Cross-validation

The Importance of Cross-Validation in Predictive Modeling

In our previous video, we discussed manually splitting our data into a single test set and evaluating out-of-sample error. However, this process can be fragile, as the presence or absence of a single outlier can vastly change our out-of-sample RMSSE. A more robust approach is to use multiple test sets and average out-of-sample error, which gives us a more precise estimate of the true out-of-sample error.

One of the most common approaches for multiple test sets is known as cross-validation. Cross-validation involves splitting our data into 10 folds or train-test splits, creating these folds in such a way that each point in our data set occurs in exactly one test set. This gives us 10 test sets and better yet means that every single point in our data set occurs exactly once and other words we get a test set that is the same size as our training set but is composed of out-of-sample predictions.

We assign each row to its test set randomly to avoid any kinds of systematic biases in our data. This is one of the best ways to estimate out-of-sample error for predictive models. It's also important to note that after doing cross-validation, we throw away all the resampled models and start over. Cross-validation is only used to estimate the out-of-sample error for your model once you know this, you refit your model on the full training data set so as to fully exploit the information in that data set.

This by definition makes cross-validation very expensive, it inherently takes 11 times as long as fitting a single model. However, 10 cross-validation models plus the final model can yield similar results as other methods of resampling. The train function in carrot does a different kind of resampling known as bootstrap validation but is also capable of doing cross-validation and the two methods in practice yield similar results.

Cross-Validation with the Carrot Package

Let's fit a cross-validated model to the Mt cars v data set first. We set the random seed since cross-validation randomly assigns rows to each fold and we want to be able to reproduce our model exactly. The train function has a formula interface which is identical to the formula interface for the LM function in Bas R, however it supports fitting hundreds of different models which are easily specified with the method argument.

In this case, we fit a linear regression model but we could just as easily specify method equals RF and fit a random forest model without changing any of our code. This is the second most useful feature of the carrot package behind the cross-validation of models it provides a common interface to hundreds of different predictive models. The TR control argument controls the parameters carrot uses for cross-validation in this course we will mostly use 10-fold cross-validation but this flexible function supports many other cross-validation schemes.

Additionally, we provide the verbos equals true argument which gives us a progress log as the model is being fit and let's us know if we have time to get coffee while the models run. Let's practice cross-validated models

"WEBVTTKind: captionsLanguage: enin the last video we manually split our data into a single test set and evaluated out of sample error once however this process is a little Fragile the presence or absence of a single outlier can vastly change our outof sample rmsse a better approach than a simple train test split is using multiple test sets and averaging out of sample error which gives us a more precise estimate of the true outof sample error one of the most common approaches for multiple test sets is known as cross validation in which we split our data into 10 folds or train test splits we create these folds in such a way that each point in our data set occurs in exactly one test set this gives us 10 test sets and better yet means that every single point in our data set occurs exactly once and other words we get a test set that is the same size as our training set but is composed of out of sample predictions we assign each row to its test set randomly to avoid any kinds of systematic biases in our data this is one of the best ways to estimate out of sample error for predictive models one important note after doing cross validation you throw away all the resampled models and start over cross validation is only used to estimate the out of sample error for your model once you know this you refit your model on the full training date data set so as to fully exploit the information in that data set this by definition makes cross validation very expensive it inherently takes 11 times as long as fitting a single model 10 cross validation models plus the final model the train function in carrot does a different kind of resampling known as bootstrap validation but is also capable of doing cross validation and the two methods in practice yield similar results let's fit a cross validated model to the Mt cars v data set first we set the random seed since cross validation randomly assigns rows to each fold and we want to be able to reproduce our model exactly the train function has a formula interface which is identical to the formula interface for the LM function in Bas R however it supports fitting hundreds of different models which are easily specified with the method argument in this case we fit a linear regression model but we could just as easily specify method equals RF and fit a random forest model without changing any of our code this is the second most useful feature of the carrot package behind the cross validation of models it provides a common interface to hundreds of different predictive models the TR control argument controls the parameters carrot uses for cross validation in this course we will mostly use 10-fold cross validation but this flexible function supports many other cross validation schemes additionally we provide the verbos it equals true argument which gives us a progress log as the model is being fit and let's us know if we have time to get coffee while the models run let's practice cross validating some modelsin the last video we manually split our data into a single test set and evaluated out of sample error once however this process is a little Fragile the presence or absence of a single outlier can vastly change our outof sample rmsse a better approach than a simple train test split is using multiple test sets and averaging out of sample error which gives us a more precise estimate of the true outof sample error one of the most common approaches for multiple test sets is known as cross validation in which we split our data into 10 folds or train test splits we create these folds in such a way that each point in our data set occurs in exactly one test set this gives us 10 test sets and better yet means that every single point in our data set occurs exactly once and other words we get a test set that is the same size as our training set but is composed of out of sample predictions we assign each row to its test set randomly to avoid any kinds of systematic biases in our data this is one of the best ways to estimate out of sample error for predictive models one important note after doing cross validation you throw away all the resampled models and start over cross validation is only used to estimate the out of sample error for your model once you know this you refit your model on the full training date data set so as to fully exploit the information in that data set this by definition makes cross validation very expensive it inherently takes 11 times as long as fitting a single model 10 cross validation models plus the final model the train function in carrot does a different kind of resampling known as bootstrap validation but is also capable of doing cross validation and the two methods in practice yield similar results let's fit a cross validated model to the Mt cars v data set first we set the random seed since cross validation randomly assigns rows to each fold and we want to be able to reproduce our model exactly the train function has a formula interface which is identical to the formula interface for the LM function in Bas R however it supports fitting hundreds of different models which are easily specified with the method argument in this case we fit a linear regression model but we could just as easily specify method equals RF and fit a random forest model without changing any of our code this is the second most useful feature of the carrot package behind the cross validation of models it provides a common interface to hundreds of different predictive models the TR control argument controls the parameters carrot uses for cross validation in this course we will mostly use 10-fold cross validation but this flexible function supports many other cross validation schemes additionally we provide the verbos it equals true argument which gives us a progress log as the model is being fit and let's us know if we have time to get coffee while the models run let's practice cross validating some models\n"