Normalizing Activations in a Network (C2W3L04)

Implementing Batch Normalization in Neural Networks

=====================================================

Batch normalization is a technique used to normalize the mean and variance of feature values in a neural network. This process helps improve the stability and speed of training, especially during the early stages of learning. In this article, we will explore how batch normalization works and how it can be implemented in a neural network.

Computing Mean and Variance

-----------------------------

Batch normalization starts by computing the mean and variance of each feature value in a layer. The mean is calculated as follows:

`mean_L = (1 / L) * sum(Li)`

where `L` is the number of elements in the layer, and `Li` is the `i-th` element.

The variance is computed using the formula:

`variance_L = (1 / L) * sum((Li - mean_L)^2) + epsilon`

where `epsilon` is a small value added to prevent division by zero during training.

Normalizing Feature Values

-----------------------------

With the mean and variance calculated, we can normalize each feature value by subtracting the mean and dividing by the standard deviation. This process ensures that each feature value has a mean of 0 and a standard deviation of 1.

`zi = (Li - mean_L) / sqrt(variance_L + epsilon)`

This normalization step helps stabilize the training process by reducing the impact of large values in the input data.

Applying Batch Normalization to Hidden Units

---------------------------------------------

Batch normalization can be applied to hidden units as well. In this case, we compute the mean and variance of each hidden unit value `Zi`:

`mean_Z = (1 / M) * sum(Zi)`

`variance_Z = (1 / M) * sum((Zi - mean_Z)^2) + epsilon`

We then normalize each hidden unit value by subtracting the mean and dividing by the standard deviation.

`zi_normalized = (Zi - mean_Z) / sqrt(variance_Z + epsilon)`

This normalization step allows us to control the distribution of hidden unit values, which can be important for certain activation functions such as sigmoid.

Using Gamma and Beta Parameters

--------------------------------

Batch normalization uses two parameters, `gamma` and `beta`, to control the mean and variance of hidden unit values. The `gamma` parameter scales the variance, while the `beta` parameter shifts the mean.

`Zi_normalized = gamma * Zi + beta`

By adjusting these parameters during training, we can control the distribution of hidden unit values and improve the performance of the neural network.

Conclusion

----------

Batch normalization is a powerful technique for improving the stability and speed of training in neural networks. By normalizing feature values and hidden unit values, we can reduce the impact of large values in the input data and improve the convergence of the learning process. Understanding how batch normalization works and how to implement it in a neural network is essential for building effective deep neural networks.

In the next article, we will explore how to fit batch normalization into a neural network, including how to handle multiple layers and activation functions.

"WEBVTTKind: captionsLanguage: enin the rise of deep learning one of the most important ideas has been an algorithm called batch normalization created by two researchers Sergey iov and Christians a greedy batch normalization it makes your hyper parameter search probably much easier it makes your neural network much more robust to the choice of hyper parameters is much bigger range of hyper parameters that work well and they'll also enable you to much more easily train even very deep networks let's see how batch normalization works when training a model such as logistic regression you might remember that normalizing the input features can speed up learning in computer means subtract off the means of your training set computer variances this sum of X I squared this is a element wise squaring and then normalize your data set according to the variances and we form an earlier video how this can turn the contours of the learning problem from something that might be very elongated you know something that is more round and easier for an algorithm like gradient descent to optimize so this works in terms of normalizing the input feature values to a new network or to logistic regression now how about a deeper model you have not just input features X but in this layer you have activations a one in this layer you have activations a two and so on so if you want to train the parameters say W 3 B 3 then one would be nice if you can normalize the mean and variance of a two to make the training of W 3 B 3 more efficient in the case of logistic regression we saw how normalizing x1 x2 x3 maybe helps you train W and B more efficiently so here the question is for any hidden layer can we normalize the values of a let's say a 2 in this example but really anything later so as to train W bb3 faster right since a2 is the input to the next layer that therefore affects the training of w3 and b3 so this is what passional does batch normalization or national for short does although technically will actually normalize the values of not a two but z 2 there is some debate in the deep learning literature about whether you should normalize the value before the activation functions on v2 or whether they should normalize the value after applying the activation function a to in practice normalizing z2 is done much more often so that's the version I present and what I would recommend you use as a default choice so here is how you would implement batch norm given some intermediate values in your neural net let's say that you have some hidden unit values z 1 up to ZM and this is really from some hidden layer so it be more accurate dividers as if the some hidden layer I for I equals 1 through m but to videos writing I'm going to omit this square bracket L just to simplify the notation on this line so giving these values what you do is compute the mean as follows again all this is specific to some layer l but I'm waiting the square bracket L m and then you compute the variance using the pretty much the formula you would expect and then you will take each of the i's and normalize it to get zi normalized by subtracting off the mean and dividing by the standard deviation um for numerical stability we usually add epsilon to denominator like that just in case Sigma squared turns out to be 0 and some estimate and so now we're taking these values E and normalize them to have mean 0 and standard unit variance so every component of Z has mean 0 and variance 1 but we don't want the hidden units to always have mean 0 and variance 1 maybe it make sense but hidden units to have a different distribution so what we'll do instead is compute the call to Z tilde equals gamma Z I known plus beta and here gamma and beta are learn about parameters of your model so they're using gradient descents or some other algorithm like the gradient descent with momentum RMS proper atom you would update the parameters gamma and beta just as you would update the weights of the neural network now notice that the effect of gamma and beta is that it allows you to set the mean of V total to be whatever you want it to be in fact if gamma equals square root Sigma squared plus Epsilon so if camera were equal to this denominator term and if beta were equal to MU so this value up here then the effect of gamma xenon plus beta is that it would exactly invert this equation so if this is true then actually these older I is equal to VI and so by an appropriate setting of the parameters gamma and beta this normalization step that is these four equations is just computing essentially the identity function but by choosing other values of gamma and beta this allows you to make the hidden unit values of other means and be winces as well and so the way you fit this into your neural network is whereas previously you are using these values V 1 Z 2 and so on you will now use Z 2 there I instead of Z I for the later computations on your network and we want to put back in this sum square bracket L you know to explicitly to know which layer it is in you can put it back there so the intuition I hope you take away from this is that we saw how normalizing the input features X can help learning in a neural network and what - alone does is apply that normalization process not just to the input layer but to the values even deep in some hidden there in the neural networks we apply this type of normalization to normalize the mean and variance of some of your hidden unit values V but one difference between the trading inputs and these hidden unit values is you might not want your hidden unit values to be forced to mean 0 and variance 1 for example if you have a sigmoid activation function you don't want your values to always be clustered here you might want them to the larger variance or have a mean that's different than 0 in order to better take advantage of the non-linearity of the sigmoid function rather than have all your values being just listed within your vision so that's why with the parameters gamma and beta you can now make sure that your VI values have the range of values that you want but what it does really it ensures that your hidden units have standardized mean and variance where the mean and variance are controlled by two explicit parameters gamma and beta which the learning algorithm concentr whatever one so what it really does is it normalizes in mean and variance of these hidden unit values really the VI to have some fixed mean and variance and that mean and variance could be 0 and 1 or it could be some other value and is controlled by these parameters gamma and beta so I hope that gives you a sense of the mechanics of how to implement a tional at least for a single layer in the net in the next video I want to show you how to fit bash them into the neural network you can deepen into network and how to make it work for the many different layers on your network and after that will give some more intuition about why bash storm could help you train your network so in case why were filthy a little bit mysterious stay with me and I think in the two videos from now we'll really make that clearerin the rise of deep learning one of the most important ideas has been an algorithm called batch normalization created by two researchers Sergey iov and Christians a greedy batch normalization it makes your hyper parameter search probably much easier it makes your neural network much more robust to the choice of hyper parameters is much bigger range of hyper parameters that work well and they'll also enable you to much more easily train even very deep networks let's see how batch normalization works when training a model such as logistic regression you might remember that normalizing the input features can speed up learning in computer means subtract off the means of your training set computer variances this sum of X I squared this is a element wise squaring and then normalize your data set according to the variances and we form an earlier video how this can turn the contours of the learning problem from something that might be very elongated you know something that is more round and easier for an algorithm like gradient descent to optimize so this works in terms of normalizing the input feature values to a new network or to logistic regression now how about a deeper model you have not just input features X but in this layer you have activations a one in this layer you have activations a two and so on so if you want to train the parameters say W 3 B 3 then one would be nice if you can normalize the mean and variance of a two to make the training of W 3 B 3 more efficient in the case of logistic regression we saw how normalizing x1 x2 x3 maybe helps you train W and B more efficiently so here the question is for any hidden layer can we normalize the values of a let's say a 2 in this example but really anything later so as to train W bb3 faster right since a2 is the input to the next layer that therefore affects the training of w3 and b3 so this is what passional does batch normalization or national for short does although technically will actually normalize the values of not a two but z 2 there is some debate in the deep learning literature about whether you should normalize the value before the activation functions on v2 or whether they should normalize the value after applying the activation function a to in practice normalizing z2 is done much more often so that's the version I present and what I would recommend you use as a default choice so here is how you would implement batch norm given some intermediate values in your neural net let's say that you have some hidden unit values z 1 up to ZM and this is really from some hidden layer so it be more accurate dividers as if the some hidden layer I for I equals 1 through m but to videos writing I'm going to omit this square bracket L just to simplify the notation on this line so giving these values what you do is compute the mean as follows again all this is specific to some layer l but I'm waiting the square bracket L m and then you compute the variance using the pretty much the formula you would expect and then you will take each of the i's and normalize it to get zi normalized by subtracting off the mean and dividing by the standard deviation um for numerical stability we usually add epsilon to denominator like that just in case Sigma squared turns out to be 0 and some estimate and so now we're taking these values E and normalize them to have mean 0 and standard unit variance so every component of Z has mean 0 and variance 1 but we don't want the hidden units to always have mean 0 and variance 1 maybe it make sense but hidden units to have a different distribution so what we'll do instead is compute the call to Z tilde equals gamma Z I known plus beta and here gamma and beta are learn about parameters of your model so they're using gradient descents or some other algorithm like the gradient descent with momentum RMS proper atom you would update the parameters gamma and beta just as you would update the weights of the neural network now notice that the effect of gamma and beta is that it allows you to set the mean of V total to be whatever you want it to be in fact if gamma equals square root Sigma squared plus Epsilon so if camera were equal to this denominator term and if beta were equal to MU so this value up here then the effect of gamma xenon plus beta is that it would exactly invert this equation so if this is true then actually these older I is equal to VI and so by an appropriate setting of the parameters gamma and beta this normalization step that is these four equations is just computing essentially the identity function but by choosing other values of gamma and beta this allows you to make the hidden unit values of other means and be winces as well and so the way you fit this into your neural network is whereas previously you are using these values V 1 Z 2 and so on you will now use Z 2 there I instead of Z I for the later computations on your network and we want to put back in this sum square bracket L you know to explicitly to know which layer it is in you can put it back there so the intuition I hope you take away from this is that we saw how normalizing the input features X can help learning in a neural network and what - alone does is apply that normalization process not just to the input layer but to the values even deep in some hidden there in the neural networks we apply this type of normalization to normalize the mean and variance of some of your hidden unit values V but one difference between the trading inputs and these hidden unit values is you might not want your hidden unit values to be forced to mean 0 and variance 1 for example if you have a sigmoid activation function you don't want your values to always be clustered here you might want them to the larger variance or have a mean that's different than 0 in order to better take advantage of the non-linearity of the sigmoid function rather than have all your values being just listed within your vision so that's why with the parameters gamma and beta you can now make sure that your VI values have the range of values that you want but what it does really it ensures that your hidden units have standardized mean and variance where the mean and variance are controlled by two explicit parameters gamma and beta which the learning algorithm concentr whatever one so what it really does is it normalizes in mean and variance of these hidden unit values really the VI to have some fixed mean and variance and that mean and variance could be 0 and 1 or it could be some other value and is controlled by these parameters gamma and beta so I hope that gives you a sense of the mechanics of how to implement a tional at least for a single layer in the net in the next video I want to show you how to fit bash them into the neural network you can deepen into network and how to make it work for the many different layers on your network and after that will give some more intuition about why bash storm could help you train your network so in case why were filthy a little bit mysterious stay with me and I think in the two videos from now we'll really make that clearer\n"