Derivatives Of Activation Functions (C1W3L08)

**Introduction to Derivatives and Activation Functions in Neural Networks**

In the context of neural networks, derivatives play a crucial role in optimizing the performance of the network. A derivative measures the rate of change of a function with respect to one of its inputs. In this article, we will delve into the world of derivatives and activation functions, which are essential building blocks of neural networks.

**Sigmoid Activation Function**

The sigmoid activation function is one of the most commonly used activation functions in neural networks. It is defined as G(Z) = 1 / (1 + e^(-Z)), where Z is the input to the function. The derivative of this function, denoted by dG/dZ, can be computed using calculus. The formula for the derivative is dG/dZ = a * (1 - G(Z))^2, where a is the input value Z. This formula simplifies to a * 1 minus a, which makes it easier to compute the derivative.

**Hyperbolic Tangent Activation Function**

The hyperbolic tangent activation function is another widely used activation function in neural networks. It is defined as G(Z) = (e^Z - e^(-Z)) / (e^Z + e^(-Z)). The derivative of this function, denoted by dG/dZ, can be computed using calculus. The formula for the derivative is dG/dZ = 1 / (1 + Z^2), which simplifies to 1 minus a squared. This formula makes it easy to compute the derivative once the input value a is known.

**ReLU Activation Function**

The ReLU activation function is a widely used activation function in neural networks. It is defined as G(Z) = max(0, Z). The derivative of this function, denoted by dG/dZ, can be computed using calculus. However, due to the nature of the function, the derivative is technically undefined when Z is exactly equal to zero. In practice, it is common to set the derivative to either 1 or 0 when computing gradients.

**Luo Activation Function**

The Luo activation function is another widely used activation function in neural networks. It is defined as G(Z) = max(0.01 * Z, Z). The derivative of this function, denoted by dG/dZ, can be computed using calculus. The formula for the derivative is dG/dZ = 0.01 if Z is less than zero and 1 if Z is greater than zero. Once again, due to the nature of the function, the derivative is technically undefined when Z is exactly equal to zero.

**Conclusion**

In conclusion, derivatives play a crucial role in optimizing the performance of neural networks. The sigmoid, hyperbolic tangent, ReLU, and Luo activation functions are four commonly used activation functions in neural networks. Each of these activation functions has its own set of properties, including their derivatives, which can be used to optimize the network's performance. By understanding these derivatives, developers can implement efficient algorithms for training neural networks.

**Implementation of Derivatives**

Now that we have covered the building blocks of neural networks, including derivatives and activation functions, it is time to talk about implementing these concepts in software. The implementation of derivatives depends on the specific activation function being used. For example, for the sigmoid activation function, the derivative can be computed using the formula dG/dZ = a * (1 - G(Z))^2. Similarly, for the hyperbolic tangent activation function, the derivative can be computed using the formula dG/dZ = 1 / (1 + Z^2). In practice, many developers set the derivatives to either 0 or 1 when computing gradients due to the technical limitations of these functions.

**Gradient Descent**

Finally, we have the building blocks and implementation details covered. Now it's time to talk about gradient descent, which is an essential algorithm for training neural networks. Gradient descent is a widely used optimization algorithm that iteratively adjusts the model parameters to minimize the loss function. By using derivatives, developers can implement efficient algorithms for training neural networks.

In summary, this article has provided an in-depth look at derivatives and activation functions in neural networks. We have covered four commonly used activation functions, including sigmoid, hyperbolic tangent, ReLU, and Luo. Each of these activation functions has its own set of properties, including their derivatives, which can be used to optimize the network's performance. By understanding these concepts, developers can implement efficient algorithms for training neural networks and create powerful models that can solve complex problems.

"WEBVTTKind: captionsLanguage: enwhen you implement back-propagation for your neural network you need to really compute the slope or the derivative of the activation functions so let's take a look at our choices of activation functions and how you can compute the slope of these functions can see familiar sigmoid activation function and so for any given value of Z maybe this value of z this function will have some slope or some derivative corresponding to if you draw a rule line there you know the height over width of this little triangle here so if G of Z is the sigmoid function then the slope of the function is d DZ G of Z and so we know from calculus that this is the slope of G of X and Z and if you are familiar with calculus and know how to take derivatives if you take the derivative of the sigmoid function it is possible to show that it is equal to this formula and again I'm not going to do the calculus steps but if you're familiar with calculus feel free to pause the video and try to prove this yourself and so this is equal to just G of Z times 1 minus G of Z so let's just sanity check that this expression makes sense first if Z is very large so say Z is equal to 10 then G of Z will be close to 1 and so the form that we have on the Left tells us that D DZ G of Z does be close to G of Z which is equal to 1 times 1 minus 1 which is therefore very close to 0 and this isn't D correct because when Z is very launched the slope is close to 0 conversely of Z is equal to minus 10 so there's no way out there then G of Z is close to 0 so the following on the left tells us d DZ G of Z will be close to G of Z which is 0 times 1 line is 0 and so it is also very close to 0 or Sakura finally a Z is equal to zero then G of Z is equal to one-half as a sigmoid function right here and so the derivative is on equal to 1/2 times 1 minus 1/2 which is equal to 1/4 and that actually is turns out to be the correct value of the derivative or the slope of this function when Z is equal to 0 finally just to introduce one more piece of notation sometimes instead of writing this thing the shorthand for the derivative is G prime of Z so G prime of Z in calculus the the little dash on top is called time because of G prime of Z is a shorthand for the in calculus for the derivative of the function of G with respect to the input variable Z um and then in a neural network we have a equals G of Z right equals this then this formula also simplifies to a times 1 minus a so sometimes the implementation you might see something like G prime of Z equals a times 1 minus a and that just refers to you know the observation that G prime which is means derivative is equal to this over here and the advantage of this formula is that if you've already computed the value for a then by using this expression you can very quickly compute the value for the slope for G prime s all right so that was the sigmoid activation function let's now look at the Technic activation function similar to what we had previously the definition of d DZ G of Z is the slope of G of Z at a particular point of Z and if you look at the formula for the hyperbolic tangent function on any of you know calculus you can take derivatives and show that this simplifies to this formula and using the own shorthand we had previously when we call this G prime of Z you gain so if you want you can sanity check that this formula make sense so for example if Z is equal to 10 10 H of Z will be very close to 1 this goes from plus 1 to minus 1 and then G prime of Z according to this formula will be about 1 minus 1 squared so terms are equal to 0 so that was a Z is very large the slope is close to zero conversely a Z is very small say Z is equal to minus 10 then 10 H of Z will be close to minus 1 and so G prime of Z will be close to 1 minus negative 1 squared so it's close to 1 minus 1 which is also close to 0 and finally is equal to 0 then 10 H of Z is equal to 0 and then the slope is actually equal to 1 which is we selected a slope point um z is equal to 0 so just to summarize if a is equal to G of Z so if a is equal to this channel Z then the derivative G prime of Z is equal to 1 minus a squared so once again if you've already computed the value of a you can use this formula to very quickly compute the derivative as well finally here's how you compute the derivatives for the value and leakey relu activation functions for the value g of z is equal to max of 0 comma Z so the derivative is equal to you turns out to be 0 if Z is less than 0 and 1 if Z is greater than 0 and is actually our undefined technically undefined as V is equal to exactly 0 but um if you're implementing this in software it might not be a hundred percent mathematic correct but I work just fine if you it's V is exactly really zero if you set the derivative equal to 1 or decide to be zero it kind of doesn't matter if you're a Nixon of Malaysian technically G prime then becomes what's called a sub gradient of the activation function G of Z which is why gradient descent still works but you can think of it as that the chance of Z being you know zero point exactly zero zero zero is so small that it almost doesn't matter what you set the derivative to be equal to when Z is equal to zero so in practice this is what people implement for the derivative of Z and finally if you are trading on your own network with the we here a Luo activation function then G of Z is going to be max of say 0.01 Z comma Z and so G prime of Z is equal to 0.01 if Z is less than zero and 1 if Z is greater than zero and once again the gradient is technically not defined when Z is exactly equal to zero but if you implement a piece of code that sets the derivative or the essentially Prime's either a zero point zero one or two one either way it doesn't really matter when Z is exactly zero your co-workers so arms of these formulas you should either compute the slopes or the derivatives of your activation assumptions now we have this building blocks you're ready to see how to implement gradient descent for your neural network let's go into the next videos you see thatwhen you implement back-propagation for your neural network you need to really compute the slope or the derivative of the activation functions so let's take a look at our choices of activation functions and how you can compute the slope of these functions can see familiar sigmoid activation function and so for any given value of Z maybe this value of z this function will have some slope or some derivative corresponding to if you draw a rule line there you know the height over width of this little triangle here so if G of Z is the sigmoid function then the slope of the function is d DZ G of Z and so we know from calculus that this is the slope of G of X and Z and if you are familiar with calculus and know how to take derivatives if you take the derivative of the sigmoid function it is possible to show that it is equal to this formula and again I'm not going to do the calculus steps but if you're familiar with calculus feel free to pause the video and try to prove this yourself and so this is equal to just G of Z times 1 minus G of Z so let's just sanity check that this expression makes sense first if Z is very large so say Z is equal to 10 then G of Z will be close to 1 and so the form that we have on the Left tells us that D DZ G of Z does be close to G of Z which is equal to 1 times 1 minus 1 which is therefore very close to 0 and this isn't D correct because when Z is very launched the slope is close to 0 conversely of Z is equal to minus 10 so there's no way out there then G of Z is close to 0 so the following on the left tells us d DZ G of Z will be close to G of Z which is 0 times 1 line is 0 and so it is also very close to 0 or Sakura finally a Z is equal to zero then G of Z is equal to one-half as a sigmoid function right here and so the derivative is on equal to 1/2 times 1 minus 1/2 which is equal to 1/4 and that actually is turns out to be the correct value of the derivative or the slope of this function when Z is equal to 0 finally just to introduce one more piece of notation sometimes instead of writing this thing the shorthand for the derivative is G prime of Z so G prime of Z in calculus the the little dash on top is called time because of G prime of Z is a shorthand for the in calculus for the derivative of the function of G with respect to the input variable Z um and then in a neural network we have a equals G of Z right equals this then this formula also simplifies to a times 1 minus a so sometimes the implementation you might see something like G prime of Z equals a times 1 minus a and that just refers to you know the observation that G prime which is means derivative is equal to this over here and the advantage of this formula is that if you've already computed the value for a then by using this expression you can very quickly compute the value for the slope for G prime s all right so that was the sigmoid activation function let's now look at the Technic activation function similar to what we had previously the definition of d DZ G of Z is the slope of G of Z at a particular point of Z and if you look at the formula for the hyperbolic tangent function on any of you know calculus you can take derivatives and show that this simplifies to this formula and using the own shorthand we had previously when we call this G prime of Z you gain so if you want you can sanity check that this formula make sense so for example if Z is equal to 10 10 H of Z will be very close to 1 this goes from plus 1 to minus 1 and then G prime of Z according to this formula will be about 1 minus 1 squared so terms are equal to 0 so that was a Z is very large the slope is close to zero conversely a Z is very small say Z is equal to minus 10 then 10 H of Z will be close to minus 1 and so G prime of Z will be close to 1 minus negative 1 squared so it's close to 1 minus 1 which is also close to 0 and finally is equal to 0 then 10 H of Z is equal to 0 and then the slope is actually equal to 1 which is we selected a slope point um z is equal to 0 so just to summarize if a is equal to G of Z so if a is equal to this channel Z then the derivative G prime of Z is equal to 1 minus a squared so once again if you've already computed the value of a you can use this formula to very quickly compute the derivative as well finally here's how you compute the derivatives for the value and leakey relu activation functions for the value g of z is equal to max of 0 comma Z so the derivative is equal to you turns out to be 0 if Z is less than 0 and 1 if Z is greater than 0 and is actually our undefined technically undefined as V is equal to exactly 0 but um if you're implementing this in software it might not be a hundred percent mathematic correct but I work just fine if you it's V is exactly really zero if you set the derivative equal to 1 or decide to be zero it kind of doesn't matter if you're a Nixon of Malaysian technically G prime then becomes what's called a sub gradient of the activation function G of Z which is why gradient descent still works but you can think of it as that the chance of Z being you know zero point exactly zero zero zero is so small that it almost doesn't matter what you set the derivative to be equal to when Z is equal to zero so in practice this is what people implement for the derivative of Z and finally if you are trading on your own network with the we here a Luo activation function then G of Z is going to be max of say 0.01 Z comma Z and so G prime of Z is equal to 0.01 if Z is less than zero and 1 if Z is greater than zero and once again the gradient is technically not defined when Z is exactly equal to zero but if you implement a piece of code that sets the derivative or the essentially Prime's either a zero point zero one or two one either way it doesn't really matter when Z is exactly zero your co-workers so arms of these formulas you should either compute the slopes or the derivatives of your activation assumptions now we have this building blocks you're ready to see how to implement gradient descent for your neural network let's go into the next videos you see that\n"