Why Does Batch Norm Work (C2W3L06)

The Effect of Batch Normalization on Neural Networks

Batch normalization is a technique used in deep learning to normalize the input data for each layer in a neural network. This technique has several effects on the training and testing process of neural networks.

Batch Normalization Limits Updating Parameters of Earlier Layers

---------------------------------------------------------

By using batch normalization, the amount to which updating the parameters in earlier layers can affect the distribution of values that the third layer now sees is reduced. This means that the later layers have more firm ground to stand on and are less affected by the changes in input distributions. As a result, this technique causes these values to become more stable.

Batch Normalization Reduces Coupling Between Layers

---------------------------------------------------

This effect weakens the coupling between what the earlier layers' parameters have to do and what the later layers' parameters have to do. This allows each layer of the network to learn independently of other layers, which is beneficial for speeding up learning in the whole network.

Speeding Up Learning

Batch normalization speeds up learning by reducing the impact of early layers on later layers. By limiting the amount that updating the earlier layers can affect the distribution of values seen by the third layer, batch normalization allows the later layers to learn more independently and with less noise. This makes it easier for later layers to adapt to changes in input distributions.

Batch Normalization as Regularization

Batch normalization also has a slight regularization effect. Each mini-batch will have its own mean and variance, which is computed on just that one mini-batch. This can introduce some noise into the process, similar to dropout. However, this noise is not as significant as the noise introduced by dropout.

Dropout vs Batch Normalization

Dropout is a technique that randomly sets a fraction of neurons to zero during training. This has a stronger regularization effect than batch normalization because it introduces multiplicative noise into the network. In contrast, batch normalization adds noisy scales and means to each activation, which reduces its impact on downstream layers.

Bigger Mini-Batch Sizes

One strange property of dropout is that using a bigger mini-batch size can reduce the regularization effect. This is not true for batch normalization, as using a larger mini-batch size does not change its intended effect on speeding up learning.

Handling Data at Test Time

Batch normalization handles data one mini-batch at a time. At test time, you need to do something slightly differently to make sure your predictions are correct.

"WEBVTTKind: captionsLanguage: enso why does that song work just one reason you've seen how normalizing the input features the X's to mean 0 and variance 1 how that can speed up learning so rather than having some features they range from 0 to 1 and some from one to a thousand by normalizing all the features input features X to take on a similar range of values that can speed up learning so one intuition behind why passional works is this is doing a similar thing but for the values in your hidden unions and not just for your input layer now this is just a partial picture for what - norm is doing there are a couple further intuitions that will help you gain a deeper understanding of what batch tom is doing let's take a look at those in this video a second reason why batch norm works is it makes wait later or deeper than your network say the way so layer 10 more robust to changes to ways in earlier layers of the neural network say in their one to explain what I mean let's look at this motivating example let's say you're training a network maybe a shallow Network like legit regression or maybe a neural network maybe run maybe a shallow Network languages Russian or maybe a deeper network on our famous cat detection sauce but let's say that you've trained your datasets on all images of black cats if you now try to apply this network to data with colored cats where the positive examples are not just black cats like on the left but the colored cats like on the right then your Casas might not do very well so in pictures if your training set look like this where you have positive examples here and negative examples here but you were to try to generalize it to data set where we will posit examples are here and the negative examples are here then you might not expect a model trained on the data on the left to do very well on the data on the right even though you know there might be the same function it actually works well but you wouldn't expect your learning algorithm to discover that green decision boundaries just looking at the data on the left so this idea of your data distribution changing goes by the somewhat fancy name covariant shift and the idea is that if you learn some XY mapping if the distribution of X changes then you might need to retrain your learning algorithm and this is true even if the function the ground true function mapping from X to Y remains unchanged which it is in this example because the ground root function is is this picture of cattle not and they need to retrain your function becomes even more acute or becomes even worse if the ground true function shifts as well so how does this problem of covariant apply to a neural network consider a deep network like this and let's look at the learning process from the perspective of this hidden layer the third hidden layer so this network has to learn the parameters W 3 and B 3 and from the perspective of the third hidden layer it gets some set of values from the earlier leaders and then it has to do some stuff to hopefully make the output Y hat close to the ground true value Y so let me cover up the nodes on the left for a second so from the perspective of this third thin layer it gets some values let's call them a 2 1 a 2 2 a 2 3 and a 2 4 but these values might as well be features x1 x2 x3 x4 and the job of the 13 layer is to take these values and find a way to map them to my hat so you can imagine doing gradient descent so that these parameters W 3 P 3 as well as maybe W 4 B 4 and even w 5 B 5 maybe trying to learn those parameters so the network does a good job not being from the values I drew in black on the left to the output values why I had but now let's uncover the left of the network again the network is also adapting parameters W 2 B 2 and W 1 B 1 and so as these parameters change these values a 2 will also change so from the perspective of the third hidden layer these hidden unit values are changing all the time and so is suffering from the problem of covariant shift that we talked about on the previous line so what that Norm does is it reduces the amount that the distribution of these hidden unit values shifts around and if it were to plot the distribution of these hidden unit values maybe this is technically renormalized as V so this is actually V 2 1 and V 2 2 and we're going to values into the full values so we can visualize in 2d what - mom is saying is that the values of V 2 1 and V 2 2 can change and indeed they won't change when the neural network updates the parameters in the earlier layers but what - column ensures is that no matter how it changes the mean and variance of Z 2 1 and V 2 2 will remain the same so so even with the exact values of V 2 1 and V 2 to change their mean and variance while these states say mean 0 and variance 1 or not necessarily mean 0 and variance 1 but whatever value is governed by beta 2 and gamma 2 which is in your networks choosers can force it to be mean 0 and variance 1 or really any other news experience but what this does is it limits the amount to which updating the parameters in the earlier layers can affect the distribution of values that the 3rd layer now sees therefore that has to learn on and so bashed on reduces the problem of the input values changing it really causes these values to become more stable so that the later layers of the neural network has more firm ground to stand on and even though the input distribution changes a bit it changes less and what this does is even as the earlier layers keep learning the amount that this forces the later layers to adapt to the earliest layers changes is reduced or if you will it weakens the coupling between what the earlier layers parameters have to do and what the later layers parameters have to do and so it allows each layer of the network to learn by itself you know a little bit more independently of other layers and this is the effect of speeding up learning in the whole network so I hope this gives some better intuition but the takeaway is that - norm means that especially from the perspective of one of the later layers of the neural network the earlier layers don't get to shift around as much because they're constrained to have the same mean and variance and so this makes the job of learning the later layers easier in terms of batch table has a second effect as a slight regularization impact so one known into the thing about passional is that each mini batch will say mini batch xt has the values VT has the values VL scaled by the meaning variance computed on just that one mini batch now because the mean and variance computed on just that mini batch as opposed to computed on the entire dataset that mean and variance has a little bit of noise in it because this computer just on your mini batch of say 64 or 128 or maybe 256 or larger training examples so because the mean and variance has will be noisy because it's estimated with just a relatively small sample data the scaling process going from VL to z to their L that process is broken noisy as well because it's computed using a slightly noisy mean Darian's so similar to drop out and add some noise to in there's activations the way drop out as noises it takes a hidden unit and it multiplies it by zero with some probability and multiplies it by one with some probability and so dropout has multiplicative noise because it's multiplying by 0 1 whereas - norm has multiple if noise because of scaling by the standard deviation as well as hazardous noise because it's subtracting the mean we're here the estimates of the mean and the standard deviation on noisy and so similar to drop out bashed on therefore has a slight regularization effect because by adding noise to the hidden unit is forcing the downstream pending units not to rely too much on any one hidden unit and so similar to drop out this as noise of hidden layers and therefore has a very slight regularization effect because the noise added is quite small this is not a huge regularization effect and you might choose to use national together with dropouts and you might use bash norm together with dropouts if you want the more powerful regularization effective dropout and maybe one other slightly non-intuitive effect is that if you use a bigger mini-batch size right so if you use a mini batch size of say 512 instead of 64 by using a larger new batch size you're reducing this noise and therefore also reducing this regularization effect so that's one strange property of dropout which is that by using a bigger mini batch size you reduce the regularization effect having said this I wouldn't really use bash norm as a regularizer that's really not the intent of vaginal but sometimes it has this extra intended or unintended effect on your learning algorithm but really don't turn to bash norm as a regularization use it as a way to normalize your hidden units and additions in there for speed of learning and I think the regularization is an almost zero unintended side effect so I hope that gives you better intuition about what batch norm is doing before we wrap up the discussion on batch alarm there's one more detail I want to make sure you know which is that batch norm handles data one mini batch at a time it computes mean and variances on mini batches so at test time we're trying to make predictions johnny valued in your network you might not have a mini batch of examples you might be processing one single example at a time so at test time you need to do something slightly differently to make sure your predictions make sense let in the next and final video on vegetable let's talk over the details of what you need to do in order to taste you in your network train using national to make predictionsso why does that song work just one reason you've seen how normalizing the input features the X's to mean 0 and variance 1 how that can speed up learning so rather than having some features they range from 0 to 1 and some from one to a thousand by normalizing all the features input features X to take on a similar range of values that can speed up learning so one intuition behind why passional works is this is doing a similar thing but for the values in your hidden unions and not just for your input layer now this is just a partial picture for what - norm is doing there are a couple further intuitions that will help you gain a deeper understanding of what batch tom is doing let's take a look at those in this video a second reason why batch norm works is it makes wait later or deeper than your network say the way so layer 10 more robust to changes to ways in earlier layers of the neural network say in their one to explain what I mean let's look at this motivating example let's say you're training a network maybe a shallow Network like legit regression or maybe a neural network maybe run maybe a shallow Network languages Russian or maybe a deeper network on our famous cat detection sauce but let's say that you've trained your datasets on all images of black cats if you now try to apply this network to data with colored cats where the positive examples are not just black cats like on the left but the colored cats like on the right then your Casas might not do very well so in pictures if your training set look like this where you have positive examples here and negative examples here but you were to try to generalize it to data set where we will posit examples are here and the negative examples are here then you might not expect a model trained on the data on the left to do very well on the data on the right even though you know there might be the same function it actually works well but you wouldn't expect your learning algorithm to discover that green decision boundaries just looking at the data on the left so this idea of your data distribution changing goes by the somewhat fancy name covariant shift and the idea is that if you learn some XY mapping if the distribution of X changes then you might need to retrain your learning algorithm and this is true even if the function the ground true function mapping from X to Y remains unchanged which it is in this example because the ground root function is is this picture of cattle not and they need to retrain your function becomes even more acute or becomes even worse if the ground true function shifts as well so how does this problem of covariant apply to a neural network consider a deep network like this and let's look at the learning process from the perspective of this hidden layer the third hidden layer so this network has to learn the parameters W 3 and B 3 and from the perspective of the third hidden layer it gets some set of values from the earlier leaders and then it has to do some stuff to hopefully make the output Y hat close to the ground true value Y so let me cover up the nodes on the left for a second so from the perspective of this third thin layer it gets some values let's call them a 2 1 a 2 2 a 2 3 and a 2 4 but these values might as well be features x1 x2 x3 x4 and the job of the 13 layer is to take these values and find a way to map them to my hat so you can imagine doing gradient descent so that these parameters W 3 P 3 as well as maybe W 4 B 4 and even w 5 B 5 maybe trying to learn those parameters so the network does a good job not being from the values I drew in black on the left to the output values why I had but now let's uncover the left of the network again the network is also adapting parameters W 2 B 2 and W 1 B 1 and so as these parameters change these values a 2 will also change so from the perspective of the third hidden layer these hidden unit values are changing all the time and so is suffering from the problem of covariant shift that we talked about on the previous line so what that Norm does is it reduces the amount that the distribution of these hidden unit values shifts around and if it were to plot the distribution of these hidden unit values maybe this is technically renormalized as V so this is actually V 2 1 and V 2 2 and we're going to values into the full values so we can visualize in 2d what - mom is saying is that the values of V 2 1 and V 2 2 can change and indeed they won't change when the neural network updates the parameters in the earlier layers but what - column ensures is that no matter how it changes the mean and variance of Z 2 1 and V 2 2 will remain the same so so even with the exact values of V 2 1 and V 2 to change their mean and variance while these states say mean 0 and variance 1 or not necessarily mean 0 and variance 1 but whatever value is governed by beta 2 and gamma 2 which is in your networks choosers can force it to be mean 0 and variance 1 or really any other news experience but what this does is it limits the amount to which updating the parameters in the earlier layers can affect the distribution of values that the 3rd layer now sees therefore that has to learn on and so bashed on reduces the problem of the input values changing it really causes these values to become more stable so that the later layers of the neural network has more firm ground to stand on and even though the input distribution changes a bit it changes less and what this does is even as the earlier layers keep learning the amount that this forces the later layers to adapt to the earliest layers changes is reduced or if you will it weakens the coupling between what the earlier layers parameters have to do and what the later layers parameters have to do and so it allows each layer of the network to learn by itself you know a little bit more independently of other layers and this is the effect of speeding up learning in the whole network so I hope this gives some better intuition but the takeaway is that - norm means that especially from the perspective of one of the later layers of the neural network the earlier layers don't get to shift around as much because they're constrained to have the same mean and variance and so this makes the job of learning the later layers easier in terms of batch table has a second effect as a slight regularization impact so one known into the thing about passional is that each mini batch will say mini batch xt has the values VT has the values VL scaled by the meaning variance computed on just that one mini batch now because the mean and variance computed on just that mini batch as opposed to computed on the entire dataset that mean and variance has a little bit of noise in it because this computer just on your mini batch of say 64 or 128 or maybe 256 or larger training examples so because the mean and variance has will be noisy because it's estimated with just a relatively small sample data the scaling process going from VL to z to their L that process is broken noisy as well because it's computed using a slightly noisy mean Darian's so similar to drop out and add some noise to in there's activations the way drop out as noises it takes a hidden unit and it multiplies it by zero with some probability and multiplies it by one with some probability and so dropout has multiplicative noise because it's multiplying by 0 1 whereas - norm has multiple if noise because of scaling by the standard deviation as well as hazardous noise because it's subtracting the mean we're here the estimates of the mean and the standard deviation on noisy and so similar to drop out bashed on therefore has a slight regularization effect because by adding noise to the hidden unit is forcing the downstream pending units not to rely too much on any one hidden unit and so similar to drop out this as noise of hidden layers and therefore has a very slight regularization effect because the noise added is quite small this is not a huge regularization effect and you might choose to use national together with dropouts and you might use bash norm together with dropouts if you want the more powerful regularization effective dropout and maybe one other slightly non-intuitive effect is that if you use a bigger mini-batch size right so if you use a mini batch size of say 512 instead of 64 by using a larger new batch size you're reducing this noise and therefore also reducing this regularization effect so that's one strange property of dropout which is that by using a bigger mini batch size you reduce the regularization effect having said this I wouldn't really use bash norm as a regularizer that's really not the intent of vaginal but sometimes it has this extra intended or unintended effect on your learning algorithm but really don't turn to bash norm as a regularization use it as a way to normalize your hidden units and additions in there for speed of learning and I think the regularization is an almost zero unintended side effect so I hope that gives you better intuition about what batch norm is doing before we wrap up the discussion on batch alarm there's one more detail I want to make sure you know which is that batch norm handles data one mini batch at a time it computes mean and variances on mini batches so at test time we're trying to make predictions johnny valued in your network you might not have a mini batch of examples you might be processing one single example at a time so at test time you need to do something slightly differently to make sure your predictions make sense let in the next and final video on vegetable let's talk over the details of what you need to do in order to taste you in your network train using national to make predictions\n"