Python Machine Learning Tutorial _ Centering And Scaling _ Databytes

**Creating a K-Nearest Neighbors Model with Scikit-Learn**

The first step in creating a k-nearest neighbors model is to create a k-nearest neighbors classifier object. This can be done by copying and pasting the following code into a Python script:

```python

from sklearn.neighbors import KNeighborsClassifier

# Create a k-neighbor's classifier object

knn = KNeighborsClassifier()

```

The `KNeighborsClassifier` function is used to create an instance of the k-nearest neighbors classifier. The default arguments are sufficient for this example, but it's worth noting that customizing the model can be done by passing additional arguments.

**Fitting the Model**

Once the model has been created, the next step is to fit it to the training data. This can be done by calling the `fit` method on the model object and passing in the training features (`x_train`) and labels (`y_train`):

```python

from sklearn.model_selection import train_test_split

# Split the data into training and testing sets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Fit the model to the training data

knn.fit(x_train, y_train)

```

The `fit` method calculates the mean and standard deviation of the features and uses these values to scale the data.

**Measuring Model Accuracy**

After fitting the model, it's possible to measure its accuracy on the testing data. This can be done by calling the `score` method on the model object and passing in the testing features (`x_test`) and labels (`y_test`):

```python

# Measure the accuracy of the model on the testing data

accuracy = knn.score(x_test, y_test)

```

The `score` method returns an accuracy score, which is a value between 0 and 1 that represents the proportion of correct predictions.

**Standardizing Data**

If the accuracy of the model is not satisfactory, it may be worth standardizing the data. This involves centering and scaling the features to have zero mean and unit variance. Scikit-learn provides a `StandardScaler` class that can be used to do this:

```python

from sklearn.preprocessing import StandardScaler

# Create a StandardScaler object

scaler = StandardScaler()

# Fit and transform the training data

x_train_scaled = scaler.fit_transform(x_train)

# Fit and transform the testing data

x_test_scaled = scaler.transform(x_test)

```

The `fit_transform` method is used to fit the scaler to the training data and then transform both the training and testing data.

**Comparing Model Performance**

After standardizing the data, it's possible to re-fit the model and measure its accuracy on the testing data:

```python

# Fit the model to the scaled training data

knn.fit(x_train_scaled, y_train)

# Measure the accuracy of the model on the scaled testing data

accuracy = knn.score(x_test_scaled, y_test)

```

By standardizing the data, it's possible to improve the accuracy of the model. The improved accuracy is reflected in the value returned by the `score` method.

**Choosing the Right Scaler**

Scikit-learn provides several other scaler classes that can be used depending on the nature of the data. For example, the `RobustScaler` class is designed to handle outliers and skewed data, while the `MaxAbsScaler` class divides each feature by its absolute value to normalize it.

**Additional Resources**

For more information on preprocessing data for machine learning in Python, there are several resources available:

* **Preprocessing for Machine Learning**: This course provides an overview of the importance of preprocessing data in machine learning.

* **Feature Engineering from Machine Learning in Python**: This course covers the principles and practices of feature engineering, including data preprocessing techniques.

**Interesting Question on Quora**

The author recently came across a question on Quora that explores the relationship between scaling and normalization in machine learning. The question highlights the importance of understanding when to use each technique and how they can impact model performance.

Overall, standardizing the data using `StandardScaler` was sufficient to improve the accuracy of the k-nearest neighbors model from 0.55 to over 70%. This highlights the importance of preprocessing data in machine learning and demonstrates how scaling and normalization techniques can be used to improve model performance.

"WEBVTTKind: captionsLanguage: enhi i'm richie in this python machine learning tutorial we're going to look at a data pre-processing technique called centering and scaling there comes a pair i'm using datacamp workspace for this but you can use any python environment and before we start writing any code we're going to take a look at a bit of theory so first of all what is data preprocessing essentially data pre-processing is just a type of data manipulation and the idea is that you want to get your data set ready for fitting a machine learning model now we're assuming that your data set is going to be in the form of a pandas data frame or something similar and for each of the columns that are going to be input into the model these called features and the idea of data pre-processing is you can do some transformations on these features so what are centering and scaling centering means that you calculate the mean of a feature and then you subtract that mean from each element so the resulting processed feature then has a mean of zero similarly scaling means dividing each element of the feature by the standard deviation and that means that the processed feature is going to have a standard deviation of 1. one thing to note is that some people tend to use the term standardizing instead of scaling it means the same thing you might wonder when you should use centering and scaling and it all depends on the type of model that you're fitting so some models are going to assume that each feature has a standard normal distribution so standard normal means it's got a mean of 0 and a standard deviation of 1. so if you are using a k-nearest neighbors model if you're using support vector machines with the radial basis function kernel trick or if you're doing regularized regression so that means lasso or ridge regression in this case centering and scaling the features is absolutely essential for some other model types centering and scaling aren't essential steps but they can help with convergence of the model so that means sort of deep down in the bowels of the algorithm is going to help out you'll know if you have if you'll know if you have a problem with convergence because you're going to start getting warnings or errors in your code saying there's a problem with the convergence so that includes linear regression logistic regression and neural networks so in those cases centering and scaling are helpful but not necessary and for some other model types that includes tree based models so decision trees random forest gradient boost anything else like that also naive bayes centering and scaling have absolutely no effect there is no point in bothering with it now in this tutorial we're going to use psychic learn however pretty much every machine learning framework is going to have tools for centering and scaling features so you can use pi carrot you can use keras you can use whatever you want also because data pre-processing is essentially just data manipulation you can use data manipulation tools like pandas and numpy let's try a case study so we're going to fit a k-nearest neighbors model and we're going to use the diamonds data set this is a pretty famous data set that's been around in the public domain for a while it was originally made popular in our's gg plot 2 package and it is available in python via the plot9 package so the first thing i'm going to do is import this data set from plot 9. and i'm going to print the data set here you can see we've got just over 50 000 rows each row represents a diamond and when we do the modeling i'm going to use cut as the response variable and i'm going to use all the numeric features for features so just for ease of coding i'm going to ignore color and clarity for now so we're going to use carrot and then depth through to the z dimension as features before we do any modeling i'm going to calculate some summary statistics using the describe method so if we look at the carrot column you can see the smallest value is 0.2 and the largest value is just over five so by contrast if we look at the price column i think this is in us dollars it goes from about 300 to about 19 000 and the sad truth here that you can't buy a one carat diamond for one us dollar but what this means from a modeling point of view is that uh these different features have very different ranges that is they are they have different scales and this is gonna cause problems with our k nearest neighbors model so we actually only need three functions here so we're going to use train test split and this is going to split the data set into the training testing sets we're going to use k neighbors classifier to fit the model um if you're from a commonwealth country note that we use u.s english here and we're going to use standard scalar to scale the features so i'm going to import these three functions now i'm going to do this three times so i'm just going to copy and paste because i'm a bad typist so let's bring in train test split so this is in the model uh selection sub module of psychic learn oops deleted the end we're gonna get uh k neighbors classifier and this is in the neighbors kind of spell it correctly nay bars sub module of psychic learn and we've got a standard scalar so this is in the pre processing sub module let me run this make sure i've titled those correctly no errors so that seems to have worked now the next thing we're going to get the uh the response variable and the features out of that data set so first i'm just going to call these x and y it's boring names but fairly standard so the response variable that's going to be diamonds and we're going to take the cut column and then for the features we want everything except uh following columns so we don't want cut because that is the response and i said we're going to ignore color and clarity because the categorical variables can require a bit more effort to deal with so i'm going to run that that argument should be columns not column and now we're going to do the train test split which is going to use the default arguments and the trick with calling this function is to remember which way around the four results come so we call these x train x test y train and y test i think that's right order let's run this so the first step in running uh k-nearest neighbors model is to create a k-neighbor's classifier object so i'm going to call this knn and i am going to copy and paste this to his name uh we're not going to bother with any arguments just using the defaults so let's run this and we have a k k neighbors classifier object so to fit the model we just call knn.fit i'm going to fit it to the training data so we're going to pass x train and y train and run that the output's not very interesting however now we can measure the accuracy of the predictions on the testing data set so this time i'm going to call uh knn.score so score gives you the accuracy score of the model and we're going to pass it x test and y test let's run that so we get 0.55 so it's 55 accurate um that doesn't sound great it's hard to tell whether that's good or bad though let's try running the model again with standard with standardizing um so we're gonna do um centering and we're gonna do uh scaling so uh first of all we're going to create a standard scale object and one thing that's really important is that you have to do this scaling process after you've done the train test split so if you do the standardization um before you've split up the data you're going to have information from the testing data set leaking into the training set so you have extra information that you shouldn't know so there's a problem called data leakage so it's just really important to do the train test split first and then you do the uh any sort of manipulating the features afterwards so we can do the same line of code essentially twice so i'm going to call uh the resulting variable x train scaled and we're going to call the fit transform method and so what fitting means essentially it's calculating the mean and standard deviation of the features and then the transform part is when you subtract the mean and divide by the standard deviation so we can do both these things in a single step with this method and we need to pass it the training features so i'm just going to copy and paste this again so we moved twice once on the training set once on the testing set and let's run this and we're going to fit the model again so we're going to call knn dot fit and this time we're going to pass it x train scaled and we're going to use the same response as before so oops not y test y train let's run that and we're going to calculate the accuracy score again on the testing set so we're going to call k n dot score and this time we're going to pass it x test scaled and y test should run this and you see now it's much higher so before it was 0.55 we had 55 accuracy now we have more than 70 accuracy so it's gone up 15 from just two lines of code and that's pretty much the best situation you can ask for when you're doing machine learning it's like watch out kaggle grandmasters we're getting some high predictions so that's pretty great now it's worth noting that standard scale isn't the only scaler available in scikit-learn there are a few others available so um if you have a lot of um outliers or really skewed data then robust scalar might make more sense so rather than subtracting the mean and divided by the standard deviation it subtracts the median divided by the interquartile range so it's kind of uh a good way of dealing with outliers you've got max ab scalar divides by the largest absolute value in each feature so that means that um the sort of maximum range of um each feature is gonna be from minus one to one uh that's occasionally useful and similarly with mid max scalar this will convert all values to a range so you quite often see data sets where it goes from like not to one or something like that normalize is slightly different because it um transforms rows rather than columns and normalizer will scale each row so the sum of the squares of each values in that row adds up to one if you are interested in learning more there are a couple of data camp courses on this area so we've got one called uh preprocessing for machine learning and python one called feature engineering from machine learning in python these both help you prepare data for running models scikit-learn has its own free pre-processing data tutorial and as i was researching this i came across a very interesting question on quora so this goes into a bit more depth in terms of which models need scaling and normalization and why all right happy data processing thanks youhi i'm richie in this python machine learning tutorial we're going to look at a data pre-processing technique called centering and scaling there comes a pair i'm using datacamp workspace for this but you can use any python environment and before we start writing any code we're going to take a look at a bit of theory so first of all what is data preprocessing essentially data pre-processing is just a type of data manipulation and the idea is that you want to get your data set ready for fitting a machine learning model now we're assuming that your data set is going to be in the form of a pandas data frame or something similar and for each of the columns that are going to be input into the model these called features and the idea of data pre-processing is you can do some transformations on these features so what are centering and scaling centering means that you calculate the mean of a feature and then you subtract that mean from each element so the resulting processed feature then has a mean of zero similarly scaling means dividing each element of the feature by the standard deviation and that means that the processed feature is going to have a standard deviation of 1. one thing to note is that some people tend to use the term standardizing instead of scaling it means the same thing you might wonder when you should use centering and scaling and it all depends on the type of model that you're fitting so some models are going to assume that each feature has a standard normal distribution so standard normal means it's got a mean of 0 and a standard deviation of 1. so if you are using a k-nearest neighbors model if you're using support vector machines with the radial basis function kernel trick or if you're doing regularized regression so that means lasso or ridge regression in this case centering and scaling the features is absolutely essential for some other model types centering and scaling aren't essential steps but they can help with convergence of the model so that means sort of deep down in the bowels of the algorithm is going to help out you'll know if you have if you'll know if you have a problem with convergence because you're going to start getting warnings or errors in your code saying there's a problem with the convergence so that includes linear regression logistic regression and neural networks so in those cases centering and scaling are helpful but not necessary and for some other model types that includes tree based models so decision trees random forest gradient boost anything else like that also naive bayes centering and scaling have absolutely no effect there is no point in bothering with it now in this tutorial we're going to use psychic learn however pretty much every machine learning framework is going to have tools for centering and scaling features so you can use pi carrot you can use keras you can use whatever you want also because data pre-processing is essentially just data manipulation you can use data manipulation tools like pandas and numpy let's try a case study so we're going to fit a k-nearest neighbors model and we're going to use the diamonds data set this is a pretty famous data set that's been around in the public domain for a while it was originally made popular in our's gg plot 2 package and it is available in python via the plot9 package so the first thing i'm going to do is import this data set from plot 9. and i'm going to print the data set here you can see we've got just over 50 000 rows each row represents a diamond and when we do the modeling i'm going to use cut as the response variable and i'm going to use all the numeric features for features so just for ease of coding i'm going to ignore color and clarity for now so we're going to use carrot and then depth through to the z dimension as features before we do any modeling i'm going to calculate some summary statistics using the describe method so if we look at the carrot column you can see the smallest value is 0.2 and the largest value is just over five so by contrast if we look at the price column i think this is in us dollars it goes from about 300 to about 19 000 and the sad truth here that you can't buy a one carat diamond for one us dollar but what this means from a modeling point of view is that uh these different features have very different ranges that is they are they have different scales and this is gonna cause problems with our k nearest neighbors model so we actually only need three functions here so we're going to use train test split and this is going to split the data set into the training testing sets we're going to use k neighbors classifier to fit the model um if you're from a commonwealth country note that we use u.s english here and we're going to use standard scalar to scale the features so i'm going to import these three functions now i'm going to do this three times so i'm just going to copy and paste because i'm a bad typist so let's bring in train test split so this is in the model uh selection sub module of psychic learn oops deleted the end we're gonna get uh k neighbors classifier and this is in the neighbors kind of spell it correctly nay bars sub module of psychic learn and we've got a standard scalar so this is in the pre processing sub module let me run this make sure i've titled those correctly no errors so that seems to have worked now the next thing we're going to get the uh the response variable and the features out of that data set so first i'm just going to call these x and y it's boring names but fairly standard so the response variable that's going to be diamonds and we're going to take the cut column and then for the features we want everything except uh following columns so we don't want cut because that is the response and i said we're going to ignore color and clarity because the categorical variables can require a bit more effort to deal with so i'm going to run that that argument should be columns not column and now we're going to do the train test split which is going to use the default arguments and the trick with calling this function is to remember which way around the four results come so we call these x train x test y train and y test i think that's right order let's run this so the first step in running uh k-nearest neighbors model is to create a k-neighbor's classifier object so i'm going to call this knn and i am going to copy and paste this to his name uh we're not going to bother with any arguments just using the defaults so let's run this and we have a k k neighbors classifier object so to fit the model we just call knn.fit i'm going to fit it to the training data so we're going to pass x train and y train and run that the output's not very interesting however now we can measure the accuracy of the predictions on the testing data set so this time i'm going to call uh knn.score so score gives you the accuracy score of the model and we're going to pass it x test and y test let's run that so we get 0.55 so it's 55 accurate um that doesn't sound great it's hard to tell whether that's good or bad though let's try running the model again with standard with standardizing um so we're gonna do um centering and we're gonna do uh scaling so uh first of all we're going to create a standard scale object and one thing that's really important is that you have to do this scaling process after you've done the train test split so if you do the standardization um before you've split up the data you're going to have information from the testing data set leaking into the training set so you have extra information that you shouldn't know so there's a problem called data leakage so it's just really important to do the train test split first and then you do the uh any sort of manipulating the features afterwards so we can do the same line of code essentially twice so i'm going to call uh the resulting variable x train scaled and we're going to call the fit transform method and so what fitting means essentially it's calculating the mean and standard deviation of the features and then the transform part is when you subtract the mean and divide by the standard deviation so we can do both these things in a single step with this method and we need to pass it the training features so i'm just going to copy and paste this again so we moved twice once on the training set once on the testing set and let's run this and we're going to fit the model again so we're going to call knn dot fit and this time we're going to pass it x train scaled and we're going to use the same response as before so oops not y test y train let's run that and we're going to calculate the accuracy score again on the testing set so we're going to call k n dot score and this time we're going to pass it x test scaled and y test should run this and you see now it's much higher so before it was 0.55 we had 55 accuracy now we have more than 70 accuracy so it's gone up 15 from just two lines of code and that's pretty much the best situation you can ask for when you're doing machine learning it's like watch out kaggle grandmasters we're getting some high predictions so that's pretty great now it's worth noting that standard scale isn't the only scaler available in scikit-learn there are a few others available so um if you have a lot of um outliers or really skewed data then robust scalar might make more sense so rather than subtracting the mean and divided by the standard deviation it subtracts the median divided by the interquartile range so it's kind of uh a good way of dealing with outliers you've got max ab scalar divides by the largest absolute value in each feature so that means that um the sort of maximum range of um each feature is gonna be from minus one to one uh that's occasionally useful and similarly with mid max scalar this will convert all values to a range so you quite often see data sets where it goes from like not to one or something like that normalize is slightly different because it um transforms rows rather than columns and normalizer will scale each row so the sum of the squares of each values in that row adds up to one if you are interested in learning more there are a couple of data camp courses on this area so we've got one called uh preprocessing for machine learning and python one called feature engineering from machine learning in python these both help you prepare data for running models scikit-learn has its own free pre-processing data tutorial and as i was researching this i came across a very interesting question on quora so this goes into a bit more depth in terms of which models need scaling and normalization and why all right happy data processing thanks you\n"