Weight Initialization in a Deep Network (C2W1L11)

The Problem of Vanishing and Exploding Gradients in Deep Neural Networks

In order to make V not blow up and not become too small, one notices that the larger n is, the smaller you want WI to be. E is a sum of WI x i and so if you're adding up a lot of these terms, you want each of these terms to be smaller. A reasonable thing to do would be to set the variance of WI to be equal to 1 over N where n is the number of input features as going into a neuron. In practice, what you can do is set the weight matrix W for a certain layer to NP random numbers with n of L minus 1 units in each neuron.

This means that if you are using a rare Lu activation function then rather than one over n it turns out that setting the variance of WI to be equal to this value is actually a little bit better. If you're familiar with random variables, it turns out that something is a Gaussian random variable and multiplying it by square root of this will cause the variance to be 2 over n. The reason I went from n to this n superscript L minus 1 was in this example with logistic regression which is that any input features but in more general case layer L would have an L minus 1 inputs each of the units and that layer so if the input features of activations are roughly mean 0 and standard variants and the earns 1 then this will cause thee to also be take on a similar scale.

This doesn't solve the problem completely, but it definitely helps reduce the vanishing exploding gradients problem because those trying to set each of the weight matrix W you know so that is not too much bigger than 1 and not too much less than 1 so it doesn't explode or vanish too quickly. I'll just mention some other variants of this method. The variation we just described is assuming a rare Lu activation function and it's by a paper by her at all, a few other variants if you are using a ten-inch activation function then there's a paper that shows that instead of using the constant to steadily use the constant 1 and so 1 over this instead of 2 so you multiply by the square root of this.

This means that square root term would replace this term in the formula. If you're using a damaged activation function, this is called savior initialization and another version work cellobiose evangelist colleagues might see in some papers but is to use this formula which has some other server code justification but I would say if you're using a regular activation function which is really the most common activation function I would use this formula. If you're using tannish, you could try this version instead and some motors will also use this but in practice I think all of these formulas just give you a starting point which is your default value to use for the variant of the initialization of your weight matrices.

If you wish, the variants here this variance parameter could be another thing that you could - or your hyper parameters so you can have another parameter that multiplies into this formula and tune that multiplier as tall your hyper parameter search sometimes the tuning the hyper parameter has a modern sized effect it's not one of the first type of parameters I would usually try to tune but I've also seen some problems where tuning this you know helps have some usable amount but this is usually lower down for me in terms of how important it is relative to the other high preferences you can tune so I hope that gives you some intuition about the problem of vanishing or exploding gradients as well as how choosing a reasonable scaling for how you initialize the weight hopefully that makes your weight not explode too quickly and not be k20 too quickly so you can train a reasonably deep network without the weights or the gradients exploding or vanishing too much when you train deep networks.

"WEBVTTKind: captionsLanguage: enin the last video you saw how very deep neural networks can have the problems of banishing and exploding gradients it turns out that a partial solution to this doesn't solve an entirely but host a lot is better or more careful choice of the random initialization for your neural network to understand this let's start with the example of initializing the weights for a single neuron and they will go on to generalize this to a deep network let's go through this with an example with just a single neuron and we'll talk about a deep manzanita so the single neuron you might input for features x1 3 X 4 and then you have some a equals G of Z and then it outputs um Y and later on for a definite you know these inputs will be very some layer AOL but for now let's just call this thanks for now so Z is going to be equal to W 1 X 1 plus W 2 X 2 plus dot dot plus I guess W n X n and and let's set B equals 0 so you know let's just ignore be for now so in order to make V not blow up and not become too small you notice that the larger n is the smaller you want WI to be right because E is a sum of WI X I and so if you're adding up a lot of these terms you want each of these terms to be smaller one reasonable thing to do would be to set the variance of WI to be equal to 1 over N where n is the number of input features as going into a neuron so in practice what you can do is set the weight matrix W for a certain layer to the NP thoughts random thought Rand n you know and then whatever the shape of the matrix is you fold this out here um and then times square root of 1 over number of features set into each neuron in their health is going to n of L minus 1 because that's the number of units that are feeding in to each of the units in layer now it turns out that if you are using a rare Lu activation function then rather than one over n it turns out that said military institutions were a little bit better so you often see that in initialization especially using a value activation function so if GL of the years value of V o and tavella how familiar you are with random variables it turns out that something is Gaussian random variable and then multiplying it by square root of this that says the variance to to be equal to this thing to be 2 over n okay and the reason I went from n to this n superscript L minus 1 was in this example with logistic regression which is that any input features but in more general case layer L would have an L minus 1 inputs each of the units and that layer so if the input features of activations are roughly mean 0 and standard variants and the earns 1 then this will cause thee to also be take on a similar scale and this doesn't solve but it definitely helps reduce the vanishing exploding gradients problem because those trying to set each of the weight matrix W you know so that is not too much bigger than 1 and not too much less than 1 so it doesn't explode or vanish too quickly I'll just mention some other variants the variation we just described is assuming a rare Lu activation function and it's by a paper by her at all a few other variants if you are using a ten-inch activation function then there's a paper that shows that instead of using the constant to steadily use the constant 1 and so 1 over this instead of 2 so you multiply by the square root of this so this square root term would replace this term and you use this if you're using a damaged activation function this is called savior initialization and another version work cellobiose evangelist colleagues we might see in some papers but is to use this formula which has some other server code justification but I would say if you're using a regular activation function which is really the most common activation function I would use this formula if you're using tannish you could try this version instead and some motors will also use this but in practice I think all of these formulas just give you a starting point which is your default value to use for the variant of the initialization of your weight matrices if you wish the variants here this variance parameter could be another thing that you could - or your hyper parameters so you can have another parameter that multiplies into this formula and tune that multiplier as tall your hyper parameter search sometimes the tuning the hyper parameter has a modern sized effect it's not one of the first type of parameters I would usually try to tune but I've also seen some problems where tuning this you know helps have some usable amount but this is usually lower down for me in terms of how important it is relative to the other high preferences you can tune so I hope that gives you some intuition about the problem of vanishing or exploding gradients as well as how choosing a reasonable scaling for how you initialize the weight hopefully that makes your weight your not explode too quickly and not be k20 too quickly so you can train a reasonably deep network without the weights or the gradients exploding or vanishing too much when you train deep networks this is another trick that will help you make your neural networks train much more quicklyin the last video you saw how very deep neural networks can have the problems of banishing and exploding gradients it turns out that a partial solution to this doesn't solve an entirely but host a lot is better or more careful choice of the random initialization for your neural network to understand this let's start with the example of initializing the weights for a single neuron and they will go on to generalize this to a deep network let's go through this with an example with just a single neuron and we'll talk about a deep manzanita so the single neuron you might input for features x1 3 X 4 and then you have some a equals G of Z and then it outputs um Y and later on for a definite you know these inputs will be very some layer AOL but for now let's just call this thanks for now so Z is going to be equal to W 1 X 1 plus W 2 X 2 plus dot dot plus I guess W n X n and and let's set B equals 0 so you know let's just ignore be for now so in order to make V not blow up and not become too small you notice that the larger n is the smaller you want WI to be right because E is a sum of WI X I and so if you're adding up a lot of these terms you want each of these terms to be smaller one reasonable thing to do would be to set the variance of WI to be equal to 1 over N where n is the number of input features as going into a neuron so in practice what you can do is set the weight matrix W for a certain layer to the NP thoughts random thought Rand n you know and then whatever the shape of the matrix is you fold this out here um and then times square root of 1 over number of features set into each neuron in their health is going to n of L minus 1 because that's the number of units that are feeding in to each of the units in layer now it turns out that if you are using a rare Lu activation function then rather than one over n it turns out that said military institutions were a little bit better so you often see that in initialization especially using a value activation function so if GL of the years value of V o and tavella how familiar you are with random variables it turns out that something is Gaussian random variable and then multiplying it by square root of this that says the variance to to be equal to this thing to be 2 over n okay and the reason I went from n to this n superscript L minus 1 was in this example with logistic regression which is that any input features but in more general case layer L would have an L minus 1 inputs each of the units and that layer so if the input features of activations are roughly mean 0 and standard variants and the earns 1 then this will cause thee to also be take on a similar scale and this doesn't solve but it definitely helps reduce the vanishing exploding gradients problem because those trying to set each of the weight matrix W you know so that is not too much bigger than 1 and not too much less than 1 so it doesn't explode or vanish too quickly I'll just mention some other variants the variation we just described is assuming a rare Lu activation function and it's by a paper by her at all a few other variants if you are using a ten-inch activation function then there's a paper that shows that instead of using the constant to steadily use the constant 1 and so 1 over this instead of 2 so you multiply by the square root of this so this square root term would replace this term and you use this if you're using a damaged activation function this is called savior initialization and another version work cellobiose evangelist colleagues we might see in some papers but is to use this formula which has some other server code justification but I would say if you're using a regular activation function which is really the most common activation function I would use this formula if you're using tannish you could try this version instead and some motors will also use this but in practice I think all of these formulas just give you a starting point which is your default value to use for the variant of the initialization of your weight matrices if you wish the variants here this variance parameter could be another thing that you could - or your hyper parameters so you can have another parameter that multiplies into this formula and tune that multiplier as tall your hyper parameter search sometimes the tuning the hyper parameter has a modern sized effect it's not one of the first type of parameters I would usually try to tune but I've also seen some problems where tuning this you know helps have some usable amount but this is usually lower down for me in terms of how important it is relative to the other high preferences you can tune so I hope that gives you some intuition about the problem of vanishing or exploding gradients as well as how choosing a reasonable scaling for how you initialize the weight hopefully that makes your weight your not explode too quickly and not be k20 too quickly so you can train a reasonably deep network without the weights or the gradients exploding or vanishing too much when you train deep networks this is another trick that will help you make your neural networks train much more quickly\n"