#26 Machine Learning Specialization [Course 1, Week 2, Lesson 2]

Feature Scaling Techniques for Gradient Descent

When it comes to training machine learning models using gradient descent, feature scaling is an essential technique that can significantly impact the performance and convergence of the algorithm. In this article, we will explore three common feature scaling techniques: mean normalization, standardization, and z-score normalization.

Mean Normalization

-----------------

One of the simplest feature scaling techniques is mean normalization. This method involves subtracting the mean value of each feature from its original values to shift the range of the feature to zero. Then, dividing by a constant factor (usually 2000) scales down the values to a common range between -1 and 1.

For example, let's say we have a feature X1 with an average value of 600 square feet. We can calculate the mean normalized X1 by subtracting the mean from each value and then dividing by the difference between the maximum and minimum values. In this case, the calculation would be:

(X1 - μ1) / (2000 - 300)

where μ1 is the mean value of feature X1.

This technique normalizes the values to a range between negative 0.18 and 0.82, making it easier for gradient descent to converge. Similarly, we can apply this method to other features, such as feature X2 with an average value of 2.3, resulting in a normalized range from negative 0.46 to 0.54.

Standardization

----------------

Another common feature scaling technique is standardization, which involves subtracting the mean value and then dividing by the standard deviation (σ) of each feature. This method has been shown to be more effective than mean normalization for some algorithms.

For example, let's say we have a feature X1 with an average value of 600 square feet and a standard deviation of 450. We can calculate the standardization of X1 by subtracting its mean from each value and then dividing by its standard deviation:

(X1 - μ1) / σ1

where μ1 is the mean value and σ1 is the standard deviation.

This technique normalizes the values to a range between negative 0.67 and 3.1, which can lead to faster convergence for gradient descent. Similarly, we can apply this method to other features, such as feature X2 with an average value of 2.3 and a standard deviation of 1.4.

Z-Score Normalization

-----------------------

A third feature scaling technique is z-score normalization, which involves subtracting the mean value and then dividing by the standard deviation (σ) of each feature. This method is similar to standardization but uses the same constant for all features.

For example, let's say we have a feature X1 with an average value of 600 square feet and a standard deviation of 450. We can calculate the z-score normalization of X1 by subtracting its mean from each value and then dividing by its standard deviation:

(X1 - μ1) / σ1

where μ1 is the mean value and σ1 is the standard deviation.

This technique normalizes the values to a range between negative 0.67 and 3.1, which can lead to faster convergence for gradient descent. Similarly, we can apply this method to other features, such as feature X2 with an average value of 2.3 and a standard deviation of 1.4.

Choosing the Right Feature Scaling Technique

---------------------------------------------

When choosing a feature scaling technique, there are some general guidelines to keep in mind:

* Mean normalization is suitable for most cases and can be used as a default option.

* Standardization is recommended when the data has outliers or skewed distributions.

* Z-score normalization is preferred when the data has a large range of values.

It's also important to note that feature scaling should not affect the underlying relationships between features. In other words, the scaled values should still capture the same patterns and trends as the original data.

Conclusion

----------

Feature scaling is an essential technique for gradient descent that can significantly impact its performance and convergence. By understanding and applying the right feature scaling technique, you can improve the accuracy and efficiency of your machine learning models. Whether you choose mean normalization, standardization, or z-score normalization, remember to always consider the characteristics of your data and the specific requirements of your algorithm.

Feature Scaling for Gradient Descent

--------------------------------------

When running gradient descent, it's essential to know whether the algorithm is converging correctly and finding the global minimum. In this article, we will explore how to recognize convergence in gradient descent and choose a suitable learning rate.

Recognizing Convergence in Gradient Descent

---------------------------------------------

Gradient descent is an optimization algorithm that iteratively updates the parameters of a model based on the gradient of the loss function. To determine whether gradient descent has converged, you need to examine the behavior of the algorithm as it iterates. Here are some common signs of convergence:

* The magnitude of the gradient decreases over time.

* The loss function converges to a minimum value.

* The parameters of the model stabilize.

However, these signs alone may not be sufficient to confirm convergence. To be sure, you need to perform some additional checks, such as:

* Monitoring the change in the loss function over iterations.

* Examining the values of the gradient and Hessian matrix.

* Using techniques like early stopping or warm-up learning rates.

Choosing a Suitable Learning Rate

-------------------------------------

The learning rate is a hyperparameter that controls the step size of each iteration in gradient descent. Choosing the right learning rate can significantly impact the convergence and accuracy of your algorithm. Here are some guidelines to help you choose a suitable learning rate:

* Start with a small learning rate and gradually increase it during training.

* Use an adaptive learning rate, such as Adam or RMSProp, which adjusts the learning rate based on the magnitude of the gradient.

* Consider using a warm-up learning rate, which starts at a low value and increases over time.

In conclusion, feature scaling is an essential technique for gradient descent that can significantly impact its performance and convergence. By understanding and applying the right feature scaling technique, you can improve the accuracy and efficiency of your machine learning models.

"WEBVTTKind: captionsLanguage: enlet's look at how you can Implement feature scaling to take features that take on very different ranges of values and scale them to have comparable ranges of value to each other so how do you actually scale features well if X1 ranges from three to two thousand one way to get the scale version of X1 is to take each original X1 value and divide by 2000 the maximum of the range so the scale of X1 will range from 0.15 up to one similarly since X2 ranges from 0 to 5 you can calculate a scale version of X2 by taking each original X2 and dividing by 5 which is again the maximum so the scale X2 will now range from 0 to 1. so if you plot the scaled X1 and X2 on the graph it might look like this in addition to dividing by the maximum you can also do what's called mean normalization so what this looks like is you started the original features and then you rescale them so that both of them are centered around zero so whereas before they only had values greater than zero now they have both negative and positive values but maybe usually between negative one and plus one so to calculate the mean normalization of X1 first find the average also called the mean of X1 on your training set and let's call this mean new one with this being the Greek alphabet mu for example you may find that the average of feature one mu 1 is 600 square feet so let's take each X1 subtract the mean mu 1 and then let's divide by the difference 2000 minus 300 where 2000 is the maximum and 300 the minimum and if you do this you get the normalized X1 to range from negative 0.18 to 0.82 similarly to mean normalize X2 you can calculate the average of feature 2 and for instance mu 2 may be 2.3 then you can take each X2 subtract mu 2 and divide by 5 minus zero again the max 5 minus the Min which is 0 the mean normalize X2 now ranges from negative 0.46 to 0.54 so if you plot the training data using the mean normalized X1 and X2 it might look like this one last common rescaling method called z-score normalization to implement z-score normalization you need to calculate something called the standard deviation of each feature if you don't know what the standard deviation is don't worry about it you won't need to know it for this class or if you've heard of the normal distribution or the bell shaped curve sometimes also called the gaussian distribution this is what the standard deviation for the normal distribution looks like but if you haven't heard of this you don't need to worry about that either but if you do know what is the standard deviation then to implement a z-score normalization you first calculate the mean mu as well as the standard deviation which is often denoted by the lowercase Greek alphabet Sigma of each feature so for instance maybe feature one has a standard deviation of 450 and mean 600 then to Z score normalize X1 take each X1 subtract mu1 and then divide by the standard deviation which I'm going to denote as Sigma 1. and what you might find is that the z-score normalized X1 now ranges from negative 0.67 to 3.1 similarly if you calculate the second feature's standard deviation to be 1.4 and mean to be 2.3 then you can compute X2 minus mu 2 divided by Sigma 2 and in this case the z-score normalized by X2 might now range from negative 1.6 to 1.9 so if you plot the training data on the normalized X1 and X2 on a graph it might look like this as a rule of thumb when performing feature scaling you might want to aim for getting the features to range from maybe anywhere around negative one to somewhere around plus one for each feature X but these values negative one and plus one can be a little bit loose so if the features range from negative three to plus 3 or negative 0.3 to plus 0.3 all of these are completely okay so if you have a feature X1 that winds up being between 0 and 3 that's not a problem and you can rescale it if you want but if you don't rescale it it should work okay too or if you have a different feature X2 whose values are between negative 2 and plus 0.5 again that's okay know how I'm rescaling it but it might be okay if you leave it alone as well but if another feature like X3 here ranges from negative 100 to plus 100 then this takes on a very different range of values than something from around negative one to plus one so you're probably better off rescaling this feature X3 so that the ranges from something closer to negative one to plus one similarly if you have a feature X4 that takes on really small values say between negative 0.001 and plus 0.001 then these values are so small that means you may want to rescale it as well finally what if your feature X5 such as measurements of a Hospital patient's body temperature ranges from 98.6 to 105 degrees Fahrenheit in this case these values are around 100 which is actually pretty large compared to other scale features and this will actually cause gradient descents to run more slowly so in this case feature rescaling will likely help there's almost never any harm to carrying out feature rescaling so when in doubt I encourage you to just carry it out and that's it for feature scaling with this little technique you'll often be able to get gradient descent to run much faster so that's feature scaling and with or without feature scaling when you run gradient descent how can you know how do you check if gradient descents is really working if it is finding you the global minimum or something close to it in the next video let's take a look at how to recognize if gradient descent is converging and then in the video after that this will lead to discussion of how to choose a good learning rate for gradient descentlet's look at how you can Implement feature scaling to take features that take on very different ranges of values and scale them to have comparable ranges of value to each other so how do you actually scale features well if X1 ranges from three to two thousand one way to get the scale version of X1 is to take each original X1 value and divide by 2000 the maximum of the range so the scale of X1 will range from 0.15 up to one similarly since X2 ranges from 0 to 5 you can calculate a scale version of X2 by taking each original X2 and dividing by 5 which is again the maximum so the scale X2 will now range from 0 to 1. so if you plot the scaled X1 and X2 on the graph it might look like this in addition to dividing by the maximum you can also do what's called mean normalization so what this looks like is you started the original features and then you rescale them so that both of them are centered around zero so whereas before they only had values greater than zero now they have both negative and positive values but maybe usually between negative one and plus one so to calculate the mean normalization of X1 first find the average also called the mean of X1 on your training set and let's call this mean new one with this being the Greek alphabet mu for example you may find that the average of feature one mu 1 is 600 square feet so let's take each X1 subtract the mean mu 1 and then let's divide by the difference 2000 minus 300 where 2000 is the maximum and 300 the minimum and if you do this you get the normalized X1 to range from negative 0.18 to 0.82 similarly to mean normalize X2 you can calculate the average of feature 2 and for instance mu 2 may be 2.3 then you can take each X2 subtract mu 2 and divide by 5 minus zero again the max 5 minus the Min which is 0 the mean normalize X2 now ranges from negative 0.46 to 0.54 so if you plot the training data using the mean normalized X1 and X2 it might look like this one last common rescaling method called z-score normalization to implement z-score normalization you need to calculate something called the standard deviation of each feature if you don't know what the standard deviation is don't worry about it you won't need to know it for this class or if you've heard of the normal distribution or the bell shaped curve sometimes also called the gaussian distribution this is what the standard deviation for the normal distribution looks like but if you haven't heard of this you don't need to worry about that either but if you do know what is the standard deviation then to implement a z-score normalization you first calculate the mean mu as well as the standard deviation which is often denoted by the lowercase Greek alphabet Sigma of each feature so for instance maybe feature one has a standard deviation of 450 and mean 600 then to Z score normalize X1 take each X1 subtract mu1 and then divide by the standard deviation which I'm going to denote as Sigma 1. and what you might find is that the z-score normalized X1 now ranges from negative 0.67 to 3.1 similarly if you calculate the second feature's standard deviation to be 1.4 and mean to be 2.3 then you can compute X2 minus mu 2 divided by Sigma 2 and in this case the z-score normalized by X2 might now range from negative 1.6 to 1.9 so if you plot the training data on the normalized X1 and X2 on a graph it might look like this as a rule of thumb when performing feature scaling you might want to aim for getting the features to range from maybe anywhere around negative one to somewhere around plus one for each feature X but these values negative one and plus one can be a little bit loose so if the features range from negative three to plus 3 or negative 0.3 to plus 0.3 all of these are completely okay so if you have a feature X1 that winds up being between 0 and 3 that's not a problem and you can rescale it if you want but if you don't rescale it it should work okay too or if you have a different feature X2 whose values are between negative 2 and plus 0.5 again that's okay know how I'm rescaling it but it might be okay if you leave it alone as well but if another feature like X3 here ranges from negative 100 to plus 100 then this takes on a very different range of values than something from around negative one to plus one so you're probably better off rescaling this feature X3 so that the ranges from something closer to negative one to plus one similarly if you have a feature X4 that takes on really small values say between negative 0.001 and plus 0.001 then these values are so small that means you may want to rescale it as well finally what if your feature X5 such as measurements of a Hospital patient's body temperature ranges from 98.6 to 105 degrees Fahrenheit in this case these values are around 100 which is actually pretty large compared to other scale features and this will actually cause gradient descents to run more slowly so in this case feature rescaling will likely help there's almost never any harm to carrying out feature rescaling so when in doubt I encourage you to just carry it out and that's it for feature scaling with this little technique you'll often be able to get gradient descent to run much faster so that's feature scaling and with or without feature scaling when you run gradient descent how can you know how do you check if gradient descents is really working if it is finding you the global minimum or something close to it in the next video let's take a look at how to recognize if gradient descent is converging and then in the video after that this will lead to discussion of how to choose a good learning rate for gradient descent\n"