Why Regularization Reduces Overfitting (C2W1L05)

The Effects of Hidden Units in Neural Networks

One of the key components of neural networks is the hidden layer, which consists of hidden units. The impact of these hidden units on the overall performance of the network can be significant. When there are many hidden units, it can lead to overfitting, where the network becomes too specialized to the training data and fails to generalize well to new, unseen data.

To understand the effects of hidden units better, let's consider a simplified example. Suppose we have a simple neural network with two input features, two hidden units, and one output feature. The activation function used is a sigmoid function. If there are no hidden units, the network becomes simply a logistic regression model, which can only learn linear relationships between inputs and outputs.

However, when we add hidden units to the network, it becomes more complex and powerful. But if the number of hidden units is too large, it can lead to overfitting. In this case, the network becomes overly specialized to the training data and fails to generalize well to new, unseen data.

To mitigate the effects of overfitting, regularization techniques are used. One common technique is L2 regularization, also known as weight decay. The idea behind L2 regularization is to add a penalty term to the cost function that discourages large weights. This penalty term is proportional to the square of the magnitude of the weights.

Mathematically, the cost function with L2 regularization can be written as:

J(W, B) = (1/2) ||W||^2 + (1/2) ||B||^2 + J(W, B)

where W is the matrix of weights, B is the bias vector, and J(W, B) is the original cost function without regularization.

The L2 regularization term adds a penalty to the cost function that discourages large weights. The amount of this penalty can be controlled by the value of the regularization parameter lambda (λ). When λ is very small, the penalty term has little effect on the optimization process. However, when λ is very large, the penalty term becomes significant, and the weights are reduced significantly.

The intuition behind L2 regularization is that it reduces the impact of noise in the data by penalizing large weights. This is because large weights can amplify the effects of random fluctuations in the data, leading to overfitting. By reducing the magnitude of these weights, L2 regularization helps to regularize the network and prevent overfitting.

Another way to think about L2 regularization is that it reduces the capacity of the network by setting some of its parameters to zero. This can be done by setting the weights of certain neurons to zero, effectively removing them from the network.

In practice, L2 regularization can have a significant impact on the performance of neural networks. When implemented correctly, it can help to prevent overfitting and improve the generalization ability of the network. However, if not used carefully, L2 regularization can also lead to underfitting, where the network fails to capture important patterns in the data.

Regularization Techniques

There are several regularization techniques that can be used to prevent overfitting in neural networks. One common technique is dropout regularization, which involves randomly setting a fraction of the neurons in the network to zero during training. This helps to reduce overfitting by preventing the network from relying too heavily on any single neuron or group of neurons.

Another technique is weight decay, also known as L2 regularization, which involves adding a penalty term to the cost function that discourages large weights. This helps to regularize the network and prevent overfitting.

Regularization techniques can be used in combination with other techniques, such as batch normalization and data augmentation, to further improve the performance of neural networks.

Implementing Regularization

When implementing regularization techniques in a neural network, it's essential to understand how they work and how to use them effectively. One key thing to remember is that regularization techniques should be used carefully, as they can have a significant impact on the performance of the network.

In the case of L2 regularization, for example, the regularization parameter lambda (λ) needs to be tuned carefully. If λ is too small, the penalty term may not have enough effect on the optimization process, and the network may still overfit. On the other hand, if λ is too large, the weights may be reduced too much, leading to underfitting.

To implement regularization techniques effectively, it's essential to plot the cost function J as a function of the number of iterations of gradient descent. This helps to ensure that the cost function is decreasing monotonically after every iteration, which is a key indicator of convergence.

Conclusion

Regularization techniques are essential tools in deep learning that can help prevent overfitting and improve the generalization ability of neural networks. L2 regularization is one common technique that involves adding a penalty term to the cost function that discourages large weights. By understanding how L2 regularization works and how to use it effectively, you can implement this technique in your own neural network projects.

Another technique that can be used in combination with L2 regularization is dropout regularization. This involves randomly setting a fraction of the neurons in the network to zero during training. This helps to reduce overfitting by preventing the network from relying too heavily on any single neuron or group of neurons.

By understanding how regularization techniques work and how to use them effectively, you can build more robust and generalizable neural networks that perform well on a variety of tasks.

"WEBVTTKind: captionsLanguage: enwhy does regularization help with overfitting why does it help with reducing variance problems let's go through a couple examples to gain some intuition about how it works so recall that our high bias high variance and right just right pictures from earlier video I look something like this now let's be a fitting a large and deep neural network I know I haven't drawn this one too large or too deep but let's do things on your network and is currently overfitting so you have some cost function right J of W B equals some of the losses like so all right and so what we do for a regularization was add this extra term that penalizes the weight matrices from being to launch we said that was a for being a small so why is it that shrinking the l2 norm or the Frobenius norm of the parameters might cause less overfitting one piece of intuition is that if you you know crank the regularization lambda to be really really base they'll be really incentivized to set the weight matrices W to be reasonably close to zero so one piece of inversion is maybe set the ways to be so close to zero for a lot of hidden units there's basically fevering out along the impact of these hidden units an adapter case then you know this much simplified new network becomes a much smaller neural network in fact it's almost like the logistic regression Union you know bin stack multiple layers B and so that will take you from this overfitting case much closer to the left towards a high bias case but hopefully there'll be an intermediate value of lambda the results in the result closer to this just right case in the middle but the intuition is that by cranking up lambda to be really big it will set W close to zero which in practice this isn't actually what happens the can think of it as zeroing out or at least reducing the impacted law the hidden units so you end up with what might feel like a simpler network like this closer and closer to as if you were just using logistic progression the intuition of completely zeroing out a bunch of hidden units isn't quite right it turns out that what actually happens and it will still use all the hidden units but each of them will just have a much smaller effect but you do end up with a simple network and as if you have a smaller network that is therefore less prone to overfitting so I'm not sure this intuition helps but when you implement regularization in the primary exercise you actually see some of these variance reduction results yourself here's another attempt at additional intuition for why regularization helps prevent overfitting and for this I'm going to assume that we're using the 10h activation function which looks like this right so there's a G of Z equals 10 H of Z so if that's the case notice that so long as Z is quite small so the Z takes on only a smallish range of parameters maybe around here then you're just using the linear regime of the Technische function there's only a Z is allowed to wonder up you know to larger values or smaller values like so that the activation function starts to become less linear so the intuition you might take away from this is that it launder the regularization parameter is launched then you have that your parameters will be relatively small because they are penalized to be large in the cost function and so the weights W are small then because Z is equal to w right and then technically plus B or but if W tells you very small then Z will also be low to be small and in particular is Z ends up taking relatively small values just invicible range then G of Z will be roughly linear so it's as if every layer will be roughly linear as it is just linear regression and we saw on course one that if every layer is linear then your whole network is just a linear network and so even a very deep network but a deep network where the linear activation function is at the end they only able to compute the linear function so it's not able to you know fit those very very complicated decisions very nonlinear decision boundaries that allow it to you know really a over fit right the datasets like we saw on the overfitting high variance case on the previous slide so just to summarize um if the regularization term is very large the parameters W very small so Z will be relatively small kind of ignoring the effect would be for now but so Z is relatively so Z be relatively small or really should say it takes on a small range of values and so the activation function this chain HCA will be relatively linear and so your whole neural network will be computing something not too far from a big linear function which is therefore a pretty simple function about in a very complex highly nonlinear function and so it's also much less able to open it and again when you implement regularization for yourself in the current exercise you'll be able to see some of these effects yourself before wrapping up our discussion on regularization I just want to give you one implementational tip which is that when influencing regularization we took our definition of the cost function J and we actually modified it by adding this extra term that penalizes the waste being too large and so if you implement gradient descent one of the steps to debug gradient descent is to plot the cost function J as a function of the number of iterations of gradient descent and you want to see that the cost function J your decreases monotonically after every iteration of gradient and if you're implementing regularization then please remember that J now has this new definition if you plot the old definition of J just this first term then you might not see a decrease monotonically so to divide gradients and make sure you're plotting you know this new definition of J that includes this second term as well otherwise you might not see J decrease monotonically on every single iteration so that's it for l2 regularization which is actually a regularization technique that I use the most in training people learning models in deep learning does another sometimes use regularization techniques called drop out regularization let's take a look at that in the next videowhy does regularization help with overfitting why does it help with reducing variance problems let's go through a couple examples to gain some intuition about how it works so recall that our high bias high variance and right just right pictures from earlier video I look something like this now let's be a fitting a large and deep neural network I know I haven't drawn this one too large or too deep but let's do things on your network and is currently overfitting so you have some cost function right J of W B equals some of the losses like so all right and so what we do for a regularization was add this extra term that penalizes the weight matrices from being to launch we said that was a for being a small so why is it that shrinking the l2 norm or the Frobenius norm of the parameters might cause less overfitting one piece of intuition is that if you you know crank the regularization lambda to be really really base they'll be really incentivized to set the weight matrices W to be reasonably close to zero so one piece of inversion is maybe set the ways to be so close to zero for a lot of hidden units there's basically fevering out along the impact of these hidden units an adapter case then you know this much simplified new network becomes a much smaller neural network in fact it's almost like the logistic regression Union you know bin stack multiple layers B and so that will take you from this overfitting case much closer to the left towards a high bias case but hopefully there'll be an intermediate value of lambda the results in the result closer to this just right case in the middle but the intuition is that by cranking up lambda to be really big it will set W close to zero which in practice this isn't actually what happens the can think of it as zeroing out or at least reducing the impacted law the hidden units so you end up with what might feel like a simpler network like this closer and closer to as if you were just using logistic progression the intuition of completely zeroing out a bunch of hidden units isn't quite right it turns out that what actually happens and it will still use all the hidden units but each of them will just have a much smaller effect but you do end up with a simple network and as if you have a smaller network that is therefore less prone to overfitting so I'm not sure this intuition helps but when you implement regularization in the primary exercise you actually see some of these variance reduction results yourself here's another attempt at additional intuition for why regularization helps prevent overfitting and for this I'm going to assume that we're using the 10h activation function which looks like this right so there's a G of Z equals 10 H of Z so if that's the case notice that so long as Z is quite small so the Z takes on only a smallish range of parameters maybe around here then you're just using the linear regime of the Technische function there's only a Z is allowed to wonder up you know to larger values or smaller values like so that the activation function starts to become less linear so the intuition you might take away from this is that it launder the regularization parameter is launched then you have that your parameters will be relatively small because they are penalized to be large in the cost function and so the weights W are small then because Z is equal to w right and then technically plus B or but if W tells you very small then Z will also be low to be small and in particular is Z ends up taking relatively small values just invicible range then G of Z will be roughly linear so it's as if every layer will be roughly linear as it is just linear regression and we saw on course one that if every layer is linear then your whole network is just a linear network and so even a very deep network but a deep network where the linear activation function is at the end they only able to compute the linear function so it's not able to you know fit those very very complicated decisions very nonlinear decision boundaries that allow it to you know really a over fit right the datasets like we saw on the overfitting high variance case on the previous slide so just to summarize um if the regularization term is very large the parameters W very small so Z will be relatively small kind of ignoring the effect would be for now but so Z is relatively so Z be relatively small or really should say it takes on a small range of values and so the activation function this chain HCA will be relatively linear and so your whole neural network will be computing something not too far from a big linear function which is therefore a pretty simple function about in a very complex highly nonlinear function and so it's also much less able to open it and again when you implement regularization for yourself in the current exercise you'll be able to see some of these effects yourself before wrapping up our discussion on regularization I just want to give you one implementational tip which is that when influencing regularization we took our definition of the cost function J and we actually modified it by adding this extra term that penalizes the waste being too large and so if you implement gradient descent one of the steps to debug gradient descent is to plot the cost function J as a function of the number of iterations of gradient descent and you want to see that the cost function J your decreases monotonically after every iteration of gradient and if you're implementing regularization then please remember that J now has this new definition if you plot the old definition of J just this first term then you might not see a decrease monotonically so to divide gradients and make sure you're plotting you know this new definition of J that includes this second term as well otherwise you might not see J decrease monotonically on every single iteration so that's it for l2 regularization which is actually a regularization technique that I use the most in training people learning models in deep learning does another sometimes use regularization techniques called drop out regularization let's take a look at that in the next video\n"