R Tutorial - Cross-validation

Using Multiple Test Sets to Improve Predictive Model Accuracy

In the previous video, we manually split our data into a single test set and evaluated out-of-sample error once. However, this process is fragile and can be greatly affected by the presence or absence of a single outlier. A more reliable approach than a simple train-test split is using multiple test sets and averaging out-of-sample error, which provides a more precise estimate of the true out-of-sample error.

One of the most common approaches for multiple test sets is known as cross-validation. Cross-validation involves splitting our data into ten folds or train-test splits, creating these folds in such a way that each point in our data set occurs in exactly one test set. This gives us ten test sets and ensures that every single point in our data set occurs exactly once. As a result, we get a test set that is the same size as our training set but is composed of out-of-sample predictions. To avoid any systematic biases in our data, we assign each row to its test set randomly.

Cross-validation is an excellent way to estimate out-of-sample error for predictive models. By doing so, you throw away all the resampled models and start over with cross-validation. It's essential to note that after performing cross-validation, you refit your model on the full training data set to fully exploit the information in that dataset. This process makes cross-validation very expensive, as it inherently takes 11 times as long as fitting a single model. The Train function in carat does a different kind of resampling known as bootstrap validation but also supports cross-validation and yields similar results.

The Train function in carat provides a common interface to hundreds of different predictive models, making it an excellent tool for practitioners. It supports fitting hundreds of different models easily specified with the method argument. In this case, we fit a linear regression model, but we could just as easily specify method equals RF and fit a random forest model without changing any of our code. This is the second most useful feature of the carat package behind cross-validation of models.

The TR control argument in the Train function controls the parameters used for cross-validation in this course, and we will mostly use 10-fold cross-validation. However, this flexible function supports many other cross-validation schemes. Additionally, we provide the verbose = itter = true argument, which gives us a progress log as the model is being fit and lets us know if we have time to get coffee while the models run.

Practicing Cross-Validation with carat

In this article, we will explore how to use carat's Train function for cross-validation. We'll start by setting up our data, ensuring that it can be randomly assigned to each fold. The Train function has a formula interface identical to the formula interface for the LM function and base R but supports fitting hundreds of different models.

We will fit a linear regression model using the Train function, demonstrating how easy it is to specify different models. For instance, we could just as easily specify method = RF and fit a random forest model without changing any of our code. This showcases the versatility of carat's Train function and its ability to support hundreds of different predictive models.

We will also explore the TR control argument, which controls the parameters used for cross-validation in this course. We will mostly use 10-fold cross-validation but can adjust the scheme using various options provided by the function. Furthermore, we will utilize the verbose = itter = true argument to display a progress log as the model is being fit and let us know if we have time to get coffee while the models run.

In conclusion, using multiple test sets and cross-validation is an excellent approach to improve predictive model accuracy. The Train function in carat provides an efficient way to perform cross-validation and offers various options for customizing the process. By mastering this technique, practitioners can accurately estimate out-of-sample error and improve their predictive models.

"WEBVTTKind: captionsLanguage: enin the last video we manually split our data into a single test set and evaluated out-of-sample error once however this process is a little fragile the presence or absence of a single outlier can vastly change our out-of-sample our MSE a better approach than a simple train test split is using multiple test sets and averaging out-of-sample error which gives us a more precise estimate of the true out-of-sample error one of the most common approaches for multiple test sets is known as cross-validation in which we split our data into ten folds or train test splits we create these folds in such a way that each point in our data set occurs in exactly one test set this gives us ten test sets and better yet means that every single point in our data set occurs exactly once in other words we get a test set that is the same size as our training set but is composed of out-of-sample predictions we assign each row to its test set randomly to avoid any kinds of systematic biases in our data this is one of the best ways to estimate out-of-sample error for predictive models one important note after doing cross-validation you throw away all the resampled models and start over cross-validation is only used to estimate the out-of-sample error for your model once you know this you refit your model on the full training data set so as to fully exploit the information in that data set this by definition makes cross-validation very expensive it inherently takes 11 times as long as fitting a single model 10 cross-validation models plus the final model the Train function in carat does a different kind of resampling known as bootstrap validation but is also capable of doing cross validation and the two methods in practice yield similar results let's fit a cross validated model to the MT cars Veda data set first we set the random seen since cross-validation randomly assigns rows to each fold and we want to be able to reproduce our model exactly the Train function has a formula interface which is identical to the formula interface for the LM function and base R however it supports fitting hundreds of different models which are easily specified with the method argument in this case we fit a linear regression model but we could just as easily specify method equals RF and fit a random forest model without changing any of our code this is the second most useful feature of the carrot package behind the cross-validation of models it provides a common interface to hundreds of different predictive models the TR control argument controls the parameters carrot uses for cross-validation in this course we will mostly use 10-fold cross-validation but this flexible function supports many other cross-validation schemes additionally we provide the verbose itter equals true argument which gives us a progress log as the model is being fit and lets us know if we have time to get coffee while the models run let's practice cross validating some modelsin the last video we manually split our data into a single test set and evaluated out-of-sample error once however this process is a little fragile the presence or absence of a single outlier can vastly change our out-of-sample our MSE a better approach than a simple train test split is using multiple test sets and averaging out-of-sample error which gives us a more precise estimate of the true out-of-sample error one of the most common approaches for multiple test sets is known as cross-validation in which we split our data into ten folds or train test splits we create these folds in such a way that each point in our data set occurs in exactly one test set this gives us ten test sets and better yet means that every single point in our data set occurs exactly once in other words we get a test set that is the same size as our training set but is composed of out-of-sample predictions we assign each row to its test set randomly to avoid any kinds of systematic biases in our data this is one of the best ways to estimate out-of-sample error for predictive models one important note after doing cross-validation you throw away all the resampled models and start over cross-validation is only used to estimate the out-of-sample error for your model once you know this you refit your model on the full training data set so as to fully exploit the information in that data set this by definition makes cross-validation very expensive it inherently takes 11 times as long as fitting a single model 10 cross-validation models plus the final model the Train function in carat does a different kind of resampling known as bootstrap validation but is also capable of doing cross validation and the two methods in practice yield similar results let's fit a cross validated model to the MT cars Veda data set first we set the random seen since cross-validation randomly assigns rows to each fold and we want to be able to reproduce our model exactly the Train function has a formula interface which is identical to the formula interface for the LM function and base R however it supports fitting hundreds of different models which are easily specified with the method argument in this case we fit a linear regression model but we could just as easily specify method equals RF and fit a random forest model without changing any of our code this is the second most useful feature of the carrot package behind the cross-validation of models it provides a common interface to hundreds of different predictive models the TR control argument controls the parameters carrot uses for cross-validation in this course we will mostly use 10-fold cross-validation but this flexible function supports many other cross-validation schemes additionally we provide the verbose itter equals true argument which gives us a progress log as the model is being fit and lets us know if we have time to get coffee while the models run let's practice cross validating some models\n"