Activation Functions (C1W3L06)

**Choosing Activation Functions for Neural Networks**

When it comes to building neural networks, one of the most important decisions you'll make is choosing an activation function for your network's hidden layers. The choice of activation function can have a significant impact on the performance and behavior of your network, and there are several different options to consider.

**The Sigmoid Activation Function**

One of the most well-known activation functions is the sigmoid function. However, in practice, it's rarely used except for the output layer when working with binary classification problems. The reason for this is that the ReLU (Rectified Linear Unit) activation function has become the default choice for hidden layers due to its simplicity and effectiveness. When you're not sure what to use, the ReLU activation function is a safe bet.

**The Default Choice: ReLU**

So why is ReLU the go-to choice for hidden layers? The answer lies in its properties. When the input to a neuron is positive, the output of the ReLU activation function will be that same value, which allows the neural network to learn non-linear relationships between inputs and outputs. However, when the input is negative, the output of the ReLU activation function will be zero, which effectively prevents the neurons from becoming overly dependent on negative features. This makes it easier for the neural network to generalize to new data.

**The Leaky ReLU Activation Function**

Some people have experimented with a variation of the ReLU activation function called the Leaky ReLU, also known as the "leak" activation function. In this version, instead of setting the output of the neuron to zero when the input is negative, the output will be a small fraction (usually 0.01) of the absolute value of the input. This allows the neural network to still learn from negative features, but in a more gradual and gentle way.

**Other Activation Functions**

There are also other activation functions that you can use in your neural networks, such as the sigmoid function for output layers or binary classification problems. However, these are generally less common and may not be as effective as the ReLU or Leaky ReLU activation functions.

**Choosing an Activation Function**

So how do you choose an activation function for your neural network? There's no one-size-fits-all answer, but here are some general guidelines:

* If you're doing binary classification, it's usually best to use the sigmoid function on the output layer.

* For hidden layers, ReLU or Leaky ReLU are good choices. The choice between these two often comes down to personal preference and the specific problem you're trying to solve.

**Testing Different Activation Functions**

One of the most effective ways to choose an activation function is to try out different options on your own dataset. By training a neural network with different activation functions, you can see which one performs best on your test data.

**The Importance of Activation Functions**

Finally, it's worth noting that activation functions are not just a nicety - they're essential for building effective neural networks. Without an activation function, the output of each neuron would simply be the input value, without any non-linearity or curvature. This would make it impossible to learn complex patterns and relationships in the data.

**The Future of Activation Functions**

As deep learning continues to evolve, we can expect to see new and more advanced activation functions being developed. However, for now, the ReLU and Leaky ReLU activation functions remain two of the most popular and effective choices for hidden layers in neural networks.

Overall, choosing an activation function is just one part of building a successful neural network. By understanding the different options available to you and experimenting with different approaches, you can build more effective models that meet your specific needs.

"WEBVTTKind: captionsLanguage: enwhen you breach a neural network one of the choices you get to make is what activation functions use independent layers as well as at the output unit of your neural network so far we've just been using the sigmoid activation function but sometimes other choices can work much better let's take a look at some of the options in the forward propagation steps for the neural network we have these two steps where we use the sigmoid function here so that sigmoid is called an activation function and G is the familiar sigmoid function N equals 1 over 1 plus e to the negative Z so in the more general case we can have a different function G of Z visually right here where G could be a nonlinear function that may not be the sigmoid function so for example the sigmoid function goes between 0 & 1 an activation function that almost always works better than the sigmoid function is the 10h function or the hyperbolic tangent function so this is Z this is a this is a equals 10 H of Z and this goes between plus 1 and minus 1 the formula for the 10h function is e to the Z minus e to the negative Z over there some and it's actually mathematically a shifted version of the sigmoid function so as a you know sigmoid function just like that but shift it so that it now crosses a zero zero point and rescale so it goes to G minus one and plus one and it turns out that for hidden units if you let the function G of Z be equal to 10 into Z almost always works better than the sigmoid function because with values between plus 1 and minus 1 the mean of the activations that come out of your head and they're closer to having a zero mean and so just as sometimes when you train a learning algorithm you might Center the data and have your data has zero mean using a technique instead of a sigmoid function kind of has the effect of centering your data so that the mean of the data is close to the zero rather than maybe a 0.5 and this actually makes learning for the next layer a little bit easier we'll say more about this in the second course when we talk about optimization algorithms as well but one takeaway is that I pretty much never use the sigmoid activation function anymore the 10h function is almost always strictly superior the one exception is for the output layer because if Y is either 0 or 1 then it makes sense for Y hat to be a number that you want to output that's between 0 and 1 rather than between minus 1 and 1 so the one exception where I would use the sigmoid activation function is when you are using binary classification in which case you values the sigmoid activation function for the output layer so G of Z 2 here is equal to Sigma of Z 2 and so what you see in this example is where you might have a 10h activation function for the hidden layer and sigmoid for the output layer so the activation functions can be different for different layers and sometimes to denote that the activation functions are different for different layers we might use these square brackets superscripts as law to indicate that G of square bracket 1 may be different than G Oh square bracket to Grandma gain square bracket 1 superscript refers to this layer and superscript square bracket 2 refers to the Alpha layer now one of the downsides of both the sigmoid function and the 10-ish function is that it Z is either very large or very small then the gradient or the derivative or the slope of this function becomes very small so Z is very large or Z is very small the slope of the function you know ends up being close to zero and so this can slow down gradient descent so one other choice that is very popular in machine learning is what's called the rectified linear unit so the rayleigh function looks like this and a formula is a equals max of 0 comma Z so the derivative is 1 so long as Z is positive and derivative or the slope is 0 when Z is negative if you're implementing this technically the derivative when Z is exactly 0 is not well-defined but when you implement is in the computer the often you get exactly is equal to 0 0 0 0 0 0 0 0 0 0 0 it is very small so you don't need to worry about it in practice you could pretend a derivative when Z is equal to 0 you can pretend is either 1 or 0 MN and you can work just fine so the fact that is not differentiable the fact that so here are some rules of thumb for choosing activation functions if your output is 0 one value if your I'm using binary classification then the sigmoid activation function is a very natural for is for the upper layer and then for all other units on value or the rectified linear unit is increasingly the default choice of activation function so if you're not sure what to use um for your head in there I would just use the relu activation function that's what you see most people using these days although sometimes people also use the tannish activation function once this advantage of the value is that the derivative is equal to 0 when we negative in practice this works just fine but there is another version of the value called the VG value will give you the formula on the next slide but instead of it being zero when Z is negative it just takes a slight slope like so so this is called the Whiskey value this usually works better than the value activation function although it's just not used as much in practice either one should be fine although it you had to pick one I usually just used in random and the advantage of both the value and the least value is that for a lot of the space of Z the derivative of the activation function the slope of the activation function is very different from zero and so in practice using the regular activation function your new network will often learn much faster than you're using the 10 age or the sigmoid activation function and the main reason is that on this less of this effect of the slope of the function going to zero which slows down learning and I know that for half of the range of Z the slope of value is zero but in practice enough of your hidden units will have Z greater than zero so learning can still be quite fast for most training examples so let's just quickly recap there are pros and cons of different activation functions here's the sigmoid activation function I would say never use this except for the output layer if you are doing binary classification when we almost never use this and the reason I almost never use this is because the 10 H is pretty much strictly superior so the 10 inch activation function is this and then the default the most commonly used activation function is the Randy which is this so you're not sure what else you use use this one and maybe you know feel free also to try to leak your value where um might be 0.01 G comma Z right so a is the max of 0.01 times Z and Z so that gives you this some dent in the function you might say you know why is that constant 0.01 well you can also make that another parameter of the learning algorithm and some people say that works even better but I hardly see people do that so but if you feel like trying it in your application you know please feel free to do so and and you can just see how it works and how long works and stick with it if it gives you a good result so I hope that gives you a sense of some of the choices of activation functions you can use in your neural network one of the themes we'll see in deep learning is that you often have a lot of different choices in how you build your neural network ranging from number of hidden units to the chosen activation function to how you neutralize the waves which we'll see later a lot of choices like that and it turns out that is sometimes difficult to get good guidelines for exactly what will work best for your problem so throw these three courses out keep on giving you a sense of what I see in the industry in terms of what's more or less popular but for your application with your applications idiosyncrasies it's actually very difficult to know in advance exactly what will work best so a common piece of variance would be if you're not sure which one of these activation functions work press you know try them all and then evaluate on like a hotel validation set or like a development set which we'll talk about later and see which one works better and then go of that and I think that by testing these different choices for your application you'd be better at future proofing your neural network architecture on against the same procedure problem as well evolutions of the algorithms rather than you know if I were to tell you always use a random activation and don't use anything else that that just may or may not apply for whatever problem you end up working on you know either either in the near future on the distant future all right so that was a choice of activation functions you see the most popular activation functions there's one other question that sometimes is asked which is why do you even need to use an activation function at all why not just do away with that so let's talk about that in the neck video and what you see why new network do need some sort of nonlinear activation functionwhen you breach a neural network one of the choices you get to make is what activation functions use independent layers as well as at the output unit of your neural network so far we've just been using the sigmoid activation function but sometimes other choices can work much better let's take a look at some of the options in the forward propagation steps for the neural network we have these two steps where we use the sigmoid function here so that sigmoid is called an activation function and G is the familiar sigmoid function N equals 1 over 1 plus e to the negative Z so in the more general case we can have a different function G of Z visually right here where G could be a nonlinear function that may not be the sigmoid function so for example the sigmoid function goes between 0 & 1 an activation function that almost always works better than the sigmoid function is the 10h function or the hyperbolic tangent function so this is Z this is a this is a equals 10 H of Z and this goes between plus 1 and minus 1 the formula for the 10h function is e to the Z minus e to the negative Z over there some and it's actually mathematically a shifted version of the sigmoid function so as a you know sigmoid function just like that but shift it so that it now crosses a zero zero point and rescale so it goes to G minus one and plus one and it turns out that for hidden units if you let the function G of Z be equal to 10 into Z almost always works better than the sigmoid function because with values between plus 1 and minus 1 the mean of the activations that come out of your head and they're closer to having a zero mean and so just as sometimes when you train a learning algorithm you might Center the data and have your data has zero mean using a technique instead of a sigmoid function kind of has the effect of centering your data so that the mean of the data is close to the zero rather than maybe a 0.5 and this actually makes learning for the next layer a little bit easier we'll say more about this in the second course when we talk about optimization algorithms as well but one takeaway is that I pretty much never use the sigmoid activation function anymore the 10h function is almost always strictly superior the one exception is for the output layer because if Y is either 0 or 1 then it makes sense for Y hat to be a number that you want to output that's between 0 and 1 rather than between minus 1 and 1 so the one exception where I would use the sigmoid activation function is when you are using binary classification in which case you values the sigmoid activation function for the output layer so G of Z 2 here is equal to Sigma of Z 2 and so what you see in this example is where you might have a 10h activation function for the hidden layer and sigmoid for the output layer so the activation functions can be different for different layers and sometimes to denote that the activation functions are different for different layers we might use these square brackets superscripts as law to indicate that G of square bracket 1 may be different than G Oh square bracket to Grandma gain square bracket 1 superscript refers to this layer and superscript square bracket 2 refers to the Alpha layer now one of the downsides of both the sigmoid function and the 10-ish function is that it Z is either very large or very small then the gradient or the derivative or the slope of this function becomes very small so Z is very large or Z is very small the slope of the function you know ends up being close to zero and so this can slow down gradient descent so one other choice that is very popular in machine learning is what's called the rectified linear unit so the rayleigh function looks like this and a formula is a equals max of 0 comma Z so the derivative is 1 so long as Z is positive and derivative or the slope is 0 when Z is negative if you're implementing this technically the derivative when Z is exactly 0 is not well-defined but when you implement is in the computer the often you get exactly is equal to 0 0 0 0 0 0 0 0 0 0 0 it is very small so you don't need to worry about it in practice you could pretend a derivative when Z is equal to 0 you can pretend is either 1 or 0 MN and you can work just fine so the fact that is not differentiable the fact that so here are some rules of thumb for choosing activation functions if your output is 0 one value if your I'm using binary classification then the sigmoid activation function is a very natural for is for the upper layer and then for all other units on value or the rectified linear unit is increasingly the default choice of activation function so if you're not sure what to use um for your head in there I would just use the relu activation function that's what you see most people using these days although sometimes people also use the tannish activation function once this advantage of the value is that the derivative is equal to 0 when we negative in practice this works just fine but there is another version of the value called the VG value will give you the formula on the next slide but instead of it being zero when Z is negative it just takes a slight slope like so so this is called the Whiskey value this usually works better than the value activation function although it's just not used as much in practice either one should be fine although it you had to pick one I usually just used in random and the advantage of both the value and the least value is that for a lot of the space of Z the derivative of the activation function the slope of the activation function is very different from zero and so in practice using the regular activation function your new network will often learn much faster than you're using the 10 age or the sigmoid activation function and the main reason is that on this less of this effect of the slope of the function going to zero which slows down learning and I know that for half of the range of Z the slope of value is zero but in practice enough of your hidden units will have Z greater than zero so learning can still be quite fast for most training examples so let's just quickly recap there are pros and cons of different activation functions here's the sigmoid activation function I would say never use this except for the output layer if you are doing binary classification when we almost never use this and the reason I almost never use this is because the 10 H is pretty much strictly superior so the 10 inch activation function is this and then the default the most commonly used activation function is the Randy which is this so you're not sure what else you use use this one and maybe you know feel free also to try to leak your value where um might be 0.01 G comma Z right so a is the max of 0.01 times Z and Z so that gives you this some dent in the function you might say you know why is that constant 0.01 well you can also make that another parameter of the learning algorithm and some people say that works even better but I hardly see people do that so but if you feel like trying it in your application you know please feel free to do so and and you can just see how it works and how long works and stick with it if it gives you a good result so I hope that gives you a sense of some of the choices of activation functions you can use in your neural network one of the themes we'll see in deep learning is that you often have a lot of different choices in how you build your neural network ranging from number of hidden units to the chosen activation function to how you neutralize the waves which we'll see later a lot of choices like that and it turns out that is sometimes difficult to get good guidelines for exactly what will work best for your problem so throw these three courses out keep on giving you a sense of what I see in the industry in terms of what's more or less popular but for your application with your applications idiosyncrasies it's actually very difficult to know in advance exactly what will work best so a common piece of variance would be if you're not sure which one of these activation functions work press you know try them all and then evaluate on like a hotel validation set or like a development set which we'll talk about later and see which one works better and then go of that and I think that by testing these different choices for your application you'd be better at future proofing your neural network architecture on against the same procedure problem as well evolutions of the algorithms rather than you know if I were to tell you always use a random activation and don't use anything else that that just may or may not apply for whatever problem you end up working on you know either either in the near future on the distant future all right so that was a choice of activation functions you see the most popular activation functions there's one other question that sometimes is asked which is why do you even need to use an activation function at all why not just do away with that so let's talk about that in the neck video and what you see why new network do need some sort of nonlinear activation function\n"