The Importance of Evaluating Classifier Performance: A Guide to Computing Accuracy and Model Complexity
Now that we know how to fit a classifier and use it to predict the labels of previously unseen data, we need to figure out how to measure its performance. In classification problems, accuracy is a commonly used metric. The accuracy of a classifier is defined as the number of correct predictions divided by the total number of data points. However, this begs the question: which data do we use to compute accuracy? We are really interested in how well our model will perform on new data that the algorithm has never seen before.
Using the Training Data to Compute Accuracy
-------------------------------------------
One approach is to compute the accuracy on the data used to fit the classifier. However, this data was used to train the classifier, and its performance will not be indicative of how well it can generalize to unseen data. For this reason, it's common practice to split your data into two sets: a training set and a test set. You trained or fit the classifier on the training set, then you make predictions on the labeled test set and compare these predictions with the known labels. You then compute the accuracy of your predictions.
Splitting Data into Training and Test Sets
-----------------------------------------
To split your data into two sets, we use the `train_test_split` function from SK learn's `more_selection`. We first import this function. Then, we use the `Train_test_split` function to randomly split our data. The first argument is the feature data (X), the second argument is the targets or labels (Y). The `test_size` keyword argument specifies what proportion of the original data is used for the test set. We also specify a random state, which sets a seed for the random number generator that splits the data into train and test sets.
Setting the Seed for Reproducibility
-------------------------------------
We can set the same seed with the `random_state` argument later to allow us to reproduce the exact split and our downstream results. The `train_test_split` function returns four arrays: the training data (X_train), the test data (X_test), the training labels (Y_train), and the test labels (Y_test). We unpack these into four variables for easier reference.
Specifying the Proportion of Data to Use for Testing
------------------------------------------------
By default, `train_test_split` splits the data into 75% training data and 25% test data. This is a good rule of thumb, but we can specify the size of the test set using the `test_size` keyword argument. In this case, we set it to 30%. It's also best practice to perform your split so that it reflects the labels on your data. We want the labels to be distributed in train and test sets as they are in the original data.
Ensuring a Balanced Split
-------------------------
To ensure a balanced split, we use the `stratify` keyword argument, which contains the list or array of labels (Y). This ensures that the proportions of each label are maintained in both the training and testing sets. By using this approach, we can avoid issues with class imbalance in our dataset.
Instantiating and Fitting the Classifier
-----------------------------------------
Once we have split our data into training and test sets, we can instantiate our classifier and fit it to the training data. We use the `fit` method of the classifier to train it on the training data. Then, we make predictions on the test data and store the results in a variable called `Y_pred`.
Evaluating Classifier Performance
---------------------------------
To evaluate the performance of our classifier, we can use the `score` method of the model. We pass in the test data (X_test) and labels (Y_test) to compute the accuracy of our classifier. In this case, the accuracy of our K-nearest neighbors model is approximately 95%. This is a good result for an out-of-the-box model.
Model Complexity: Understanding the Trade-Off
---------------------------------------------
As we increase the value of K in the K-nearest neighbors model, the decision boundary gets smoother and less curvy. We consider this to be a less complex model than those with lower values of K. However, if we increase K even more and make the model even simpler, it will perform less well on both the training and testing sets. This is known as overfitting.
Model Complexity Curves
-------------------------
To understand the trade-off between model complexity and performance, we can visualize a model complexity curve. The curve shows how the accuracy of our classifier changes as we increase or decrease the value of K. In this case, there is a sweet spot in the middle that gives us the best performance.
Practicing Splitting Data and Evaluating Classifier Performance
----------------------------------------------------------------
Now it's your turn to practice splitting your data into training and test sets, computing accuracy on your test set, and plotting model complexity curves. Don't be afraid to experiment with different values of K and see how they affect your results. Remember to use the `train_test_split` function and the `score` method to evaluate your classifier's performance.