**Splitting a Data Set: A Crucial Step in Data Science**
When it comes to data science, one of the most critical steps is splitting a data set into training and testing sets. This process allows us to evaluate the performance of our models on unseen data, which is essential for making accurate predictions and generalizing our findings. In this article, we will explore how to split a data set using the `split()` function in Python.
**Randomizing the Data Set**
Before we can split the data set, we need to randomize it. This ensures that the data samples are shuffled randomly, which is essential for avoiding any biases or patterns in our data. We can use the `random.seed()` function to set a seed number, which will allow us to reproduce the same results every time we run the code using the same seed number. In this case, we will set the seed number to 42, which is the default value.
**Removing a Percentage of the Data Set**
Once we have randomized the data set, we can remove a percentage of it to create our training and testing sets. We can use the `train_test_split()` function to split the data into two sets: one with 80% of the data and another with 20%. The default value for the `test_size` parameter is 0.5, which means that half of the data will be used for training and the other half will be used for testing.
**Choosing the Split Ratio**
We can choose any percentage to remove from the data set, but we typically use 80% for training and 20% for testing. This split ratio is widely used in the field of machine learning because it provides a good balance between training and testing our models. We will choose the default value of 50%, which means that 50% of the data will be removed from the dataset.
**Applying the Split**
Once we have chosen our split ratio, we can apply it to our data set using the `train_test_split()` function. We will double-click on the "Remove percentage" option and set it to 20%. This means that 20% of the data will be removed from the dataset, which will become our testing set.
**Saving the Training and Testing Sets**
After we have applied the split, we can save the training and testing sets as separate files. We will create a new file called "training" and save the training set to it. We will also create another file called "testing" and save the testing set to it.
**Undoing and Reverting the Split**
To undo the split, we can revert back to our original dataset by double-clicking on the "Remove percentage" option and selecting the inverse selection. This means that all of the data points will be included in both the training and testing sets.
**Using the Training Set for Model Building**
Once we have created our training set, we can use it as the input for building our machine learning model. In this case, we will build a random forest model using the `RandomForestClassifier()` function from scikit-learn.
**Evaluating the Model on the Testing Set**
After we have built and trained our model on the training set, we can evaluate its performance on the testing set. We will use the `cross_val_score()` function to perform 10-fold cross-validation on the testing set.
**Cross-Validation**
Cross-validation is a technique used to evaluate the performance of a machine learning model by splitting the data into multiple subsets and evaluating the model's performance on each subset. In this case, we are using 10-fold cross-validation, which means that our model will be evaluated 10 times, with each time using a different subset of the testing set.
**Conclusion**
Splitting a data set is an essential step in data science, as it allows us to evaluate the performance of our models on unseen data. By randomizing the data set and removing a percentage of it, we can create high-quality training and testing sets that will help us build accurate machine learning models. In this article, we have demonstrated how to split a data set using the `split()` function in Python, including how to randomize the data set, remove a percentage of it, and evaluate our model's performance on the testing set using cross-validation.
"WEBVTTKind: captionsLanguage: enin this video i'm going to be showing you how to use the weka data mining software to split the data set using the 80 20 split ratio and so without further ado we're starting right now so the first thing that you want to do is head over to the github of the data professor and you want to click on the data and so today we're going to be using the hcv arff file and here we're going to use the classification one and in the previous video i've shown you how you could prepare this file and also this file from the descriptor values that we have computed from the title descriptor software but nevertheless if you haven't followed that video check out the links in the description of the video and also you could follow along by clicking on the hcv classification arff and so we're going to take this file arff file which is comprising of the full data set and then we're going to be splitting it into two segments using a split ratio of 80 20 and so the first segment will be used for the model building and it's going to be the training set and then the remainder of 80 will be used as the testing set okay so let's download this file right click on the download and then save link as save it and then you could save it onto the desktop but because i already have it i'll just use the one i have all right and then we're going to fire up the weakest software and then head over to the explorer okay so click on open file and because the file is on the desktop we're going to go to the desktop here and then fire up the arff file and notice here that there are a total of 578 rolls or instances or the number of compounds click on choose under filter and then drill down the filters and then drill down to unsupervised and you have two options here attribute and instance because we want to be removing a percentage of the data set to be used as the training and the testing set we're going to go with the instance and then it will give you a lot of other options that you could do with the instance of the data sets and so here we're going to be removing a percentage of it and before doing that you could also randomize your data set so why don't we do that let's randomize the data set and the seat number is set at default to be 42 and so the advantage of setting a c number is that it will allow us to derive at the same solution every time that we run it using the same seed number and so here we're going to have the seed at 42 which is the default and then click on apply and so it's going to be reshuffling the data set and now we're going to go back to choose and then we're going to click on the remove percentage and so the default will be 50 so double click on this area here and so we're going to set it to be 20. click on ok and so 20 here means that we're going to be removing 20 of the data set and so let's take a look here after we click on apply 578 should be 20 less all right and now it's reduced to 462 and so now we have the training set okay so 462 is the 80 subset so we're gonna save this file so click on save button here and then we're gonna add to the name we're gonna create a new file we're gonna call it training save it and then we're going to undo okay and then we're going to double click on it again and now we're going to have invert selection to be true and then apply it and now we have 80 removed so it's the inverted selection and so now we have the testing set and so we're going to save this so click on the save button and then we're going to call it testing okay and there you have it the training set and the testing set and so the training set will comprise of 80 and the testing set will comprise of 20 all right and so let's open up the training set and so we could use this as the training set for the model building so in prior videos i've been using the full data set as training so actually theoretically that's not correct because i'm just using the entire data set the full data set and so what you need to do is split the data as mentioned in this video and then we're going to load up the training data sets and then we're going to use the trainings data set here the 80 subset as the training set okay so let's try it now go to trees let's build a random forest model okay so this is the performance on the training set and now let's say that we wanna use this training model to test on the testing sets so you wanna click on the supplied test set and then click on the set button open file select the hcv classification testing okay and then the default of class is class that's correct and now you want to click on close and click on start all right and now you see that there are 116 data samples in the testing sets and so this is the performance on the testing set and as for the cross validation it's going to be using the 80 subsets or the training set here to be used for the 10-fold cross validation okay so this is the performance for the 10-fold cross validation and make note here as i mentioned before that we have already performed the set seed to be at 42 which is the default and so what that essentially will do is it will kind of reshuffle the data set and it will set the data set to a particular shuffling and so that the next time around when you're using the same seat number it's going to be shuffling in the same manner and therefore the data samples that would be split into the 80 and 20 will be the same every time that you run it using the same seed number all right and congratulations you have now successfully split it your data set into 80 20. and so feel free to modify the split ratio to other ratio like 70 30 but the one that i like to use is the 80 20 and i think it's quite popular and so we're using it in this video if you're finding value in this video please give it a like subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videosin this video i'm going to be showing you how to use the weka data mining software to split the data set using the 80 20 split ratio and so without further ado we're starting right now so the first thing that you want to do is head over to the github of the data professor and you want to click on the data and so today we're going to be using the hcv arff file and here we're going to use the classification one and in the previous video i've shown you how you could prepare this file and also this file from the descriptor values that we have computed from the title descriptor software but nevertheless if you haven't followed that video check out the links in the description of the video and also you could follow along by clicking on the hcv classification arff and so we're going to take this file arff file which is comprising of the full data set and then we're going to be splitting it into two segments using a split ratio of 80 20 and so the first segment will be used for the model building and it's going to be the training set and then the remainder of 80 will be used as the testing set okay so let's download this file right click on the download and then save link as save it and then you could save it onto the desktop but because i already have it i'll just use the one i have all right and then we're going to fire up the weakest software and then head over to the explorer okay so click on open file and because the file is on the desktop we're going to go to the desktop here and then fire up the arff file and notice here that there are a total of 578 rolls or instances or the number of compounds click on choose under filter and then drill down the filters and then drill down to unsupervised and you have two options here attribute and instance because we want to be removing a percentage of the data set to be used as the training and the testing set we're going to go with the instance and then it will give you a lot of other options that you could do with the instance of the data sets and so here we're going to be removing a percentage of it and before doing that you could also randomize your data set so why don't we do that let's randomize the data set and the seat number is set at default to be 42 and so the advantage of setting a c number is that it will allow us to derive at the same solution every time that we run it using the same seed number and so here we're going to have the seed at 42 which is the default and then click on apply and so it's going to be reshuffling the data set and now we're going to go back to choose and then we're going to click on the remove percentage and so the default will be 50 so double click on this area here and so we're going to set it to be 20. click on ok and so 20 here means that we're going to be removing 20 of the data set and so let's take a look here after we click on apply 578 should be 20 less all right and now it's reduced to 462 and so now we have the training set okay so 462 is the 80 subset so we're gonna save this file so click on save button here and then we're gonna add to the name we're gonna create a new file we're gonna call it training save it and then we're going to undo okay and then we're going to double click on it again and now we're going to have invert selection to be true and then apply it and now we have 80 removed so it's the inverted selection and so now we have the testing set and so we're going to save this so click on the save button and then we're going to call it testing okay and there you have it the training set and the testing set and so the training set will comprise of 80 and the testing set will comprise of 20 all right and so let's open up the training set and so we could use this as the training set for the model building so in prior videos i've been using the full data set as training so actually theoretically that's not correct because i'm just using the entire data set the full data set and so what you need to do is split the data as mentioned in this video and then we're going to load up the training data sets and then we're going to use the trainings data set here the 80 subset as the training set okay so let's try it now go to trees let's build a random forest model okay so this is the performance on the training set and now let's say that we wanna use this training model to test on the testing sets so you wanna click on the supplied test set and then click on the set button open file select the hcv classification testing okay and then the default of class is class that's correct and now you want to click on close and click on start all right and now you see that there are 116 data samples in the testing sets and so this is the performance on the testing set and as for the cross validation it's going to be using the 80 subsets or the training set here to be used for the 10-fold cross validation okay so this is the performance for the 10-fold cross validation and make note here as i mentioned before that we have already performed the set seed to be at 42 which is the default and so what that essentially will do is it will kind of reshuffle the data set and it will set the data set to a particular shuffling and so that the next time around when you're using the same seat number it's going to be shuffling in the same manner and therefore the data samples that would be split into the 80 and 20 will be the same every time that you run it using the same seed number all right and congratulations you have now successfully split it your data set into 80 20. and so feel free to modify the split ratio to other ratio like 70 30 but the one that i like to use is the 80 20 and i think it's quite popular and so we're using it in this video if you're finding value in this video please give it a like subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videos\n"