#17 Machine Learning Specialization [Course 1, Week 1, Lesson 4]

# Understanding Gradient Descent: An In-Depth Look

Gradient descent is a fundamental algorithm in machine learning used to minimize cost functions and optimize model parameters. To gain a better intuition about how it works, let’s dive deeper into the concepts behind this algorithm.

## The Gradient Descent Algorithm Revisited

In its basic form, the gradient descent algorithm updates model parameters \( W \) and \( B \) iteratively to minimize a cost function \( J \). The update rule is:

\[

W := W - \alpha \frac{d}{dw}J(W)

\]

Here, \( \alpha \) (the learning rate) controls the size of the steps taken during each iteration. A larger \( \alpha \) results in bigger steps, while a smaller \( \alpha \) leads to more incremental updates. The term \( \frac{d}{dw}J(W) \) represents the derivative of the cost function with respect to the parameter \( W \). This derivative indicates the direction and magnitude of the steepest ascent of the cost function at the current point.

It’s worth noting that in multivariate calculus, this derivative is technically a partial derivative since we’re dealing with multiple parameters. However, for the purposes of implementation, we’ll refer to it simply as the "derivative."

## Building Intuition with a Single Parameter

To build intuition, let’s consider a simpler scenario where we minimize a cost function \( J \) with only one parameter \( W \). In this case, the gradient descent algorithm simplifies to:

\[

W := W - \alpha \frac{d}{dw}J(W)

\]

Imagine starting at an initial value of \( W \) on the graph of the cost function \( J(W) \), where the horizontal axis represents \( W \) and the vertical axis represents the cost \( J(W) \). The goal is to find the minimum of this function.

### Example 1: Starting with a Positive Slope

Let’s say you initialize gradient descent at a point on the function where the tangent line has a positive slope. For instance, if the slope is \( \frac{2}{1} \), it means that for every unit increase in \( W \), the cost increases by 2 units.

Since the derivative \( \frac{d}{dw}J(W) \) is positive, the update rule becomes:

\[

W := W - \alpha \times (\text{positive number})

\]

This results in a decrease in \( W \). On the graph, this corresponds to moving to the left. If this movement reduces the cost \( J(W) \), you’re getting closer to the minimum of the function.

### Example 2: Starting with a Negative Slope

Now, let’s consider initializing gradient descent at a different point on the function where the tangent line has a negative slope. For example, if the slope is \( -\frac{2}{1} \), it means that for every unit increase in \( W \), the cost decreases by 2 units.

In this case, the derivative \( \frac{d}{dw}J(W) \) is negative. The update rule becomes:

\[

W := W - \alpha \times (\text{negative number}) = W + \alpha \times |\text{number}|

\]

This results in an increase in \( W \). On the graph, this corresponds to moving to the right. If this movement reduces the cost \( J(W) \), you’re again getting closer to the minimum of the function.

These examples demonstrate how the derivative term guides gradient descent in the correct direction, whether uphill or downhill, to minimize the cost function.

## The Role of the Learning Rate (\( \alpha \))

The learning rate \( \alpha \) determines the size of the steps taken during each iteration. If \( \alpha \) is too small, it may take a very long time to converge to the minimum. On the other hand, if \( \alpha \) is too large, the algorithm might overshoot the minimum and diverge.

In the next video, we’ll explore how to choose an appropriate learning rate and discuss strategies for optimizing its value in practice.

## Conclusion

By breaking down the components of the gradient descent algorithm and examining their roles, we can see why it’s such a powerful tool for optimization. The derivative term provides the direction for movement along the cost function landscape, while the learning rate controls the step size. Together, they work to iteratively adjust the parameters \( W \) and \( B \), bringing us closer to the minimum of the cost function.

Understanding these concepts not only builds intuition but also empowers you to make informed decisions when implementing gradient descent in your own machine learning projects. Stay tuned for the next video, where we’ll delve deeper into the learning rate \( \alpha \) and explore strategies for selecting an optimal value.

"WEBVTTKind: captionsLanguage: ennow let's dive more deeply into gradient descent to gain better intuition about what it's doing and why it might make sense here's the gradient descent algorithm that you saw in the previous video and as a reminder this variable this Greek symbol Alpha is the learning rate and the learning rate controls how big of a step you take when updating the model's parameters W and B so this term here this D over d w this is the derivative term and by convention in math this D is written with this funny font here and in case anyone watching this has a PHD in math or as an expert in multivariate calculus they may be wondering that's not the derivative that's the partial derivative and yes they'd be right but for the purposes of implementing a machine learning algorithm I'm just going to call it derivative and don't worry about these little distinctions so what we're going to focus on now is get more intuition about what this learning rate and what this derivative are doing and why when multiplied together like this it results in updates to parameters W and B that makes sense in order to do this let's use a slightly simpler example where we work on minimizing just one parameter so let's say that you have a cost function J of just one parameter W where W is a number this means that gradient descents now looks like this W is updated to W minus the learning rate Alpha times D over d w of J of w and you're trying to minimize the cost by adjusting the parameter w so this is like our previous example where we had temporarily set b equal to zero with one parameter W instead of two you can look at two dimensional graphs of the cost function j instead of three dimensional graphs let's look at what gradient descents does on this function J of w here are the horizontal axis is parameter W and on the vertical axis is the cost J of w now let's initialize gradient descent with some starting value for w let's initialize it at this location so imagine that you start off at this point right here on the function J what gradient descents would do is it will update W to be W minus learning rate Alpha times D over d w of J of w let's look at what this derivative term here means a way to think about the derivative at this point on the line is to draw a tangent line which is a straight line that touches this curve at that point enough the slope of this line is the derivative of the function J at this point and to get the slope you can draw another triangle like this and if you compute the height divided by the width of this triangle that is the slope so for example the slope might be you know two over one for instance and when the tangent line is pointing up into the right the slope is positive which means that this derivative is a positive number so is greater than zero and so the updated W is going to be W minus the learning rate times some positive number the learning rate is always a positive number so if you take W minus a positive number you end up with a new value for w that is smaller so on the graph we are moving to the left we are decreasing the value of w and you may notice that this is the right thing to do if you go is to decrease the cost J because when we move toward the left on this curve the COS J decreases and you're getting closer to the minimum for J which is over here so so far gradient descent seems to be doing the right thing now let's look at another example let's take the same function G of WS above and now let's say that you initialize gradient descent at a different location Say by choosing a starting value for w that's over here on the left so that's this point of the function J now the derivative term remember is D over d w of J of w and when we look at the tangent line at this point over here the slope of this line is the derivative of J at this point but this tangent line is sloping down into the right and so this line sloping down into the right has a negative slope in other words the derivative k at this point is a negative number for instance if you draw a triangle then the height like this is negative 2 and the width is one so the slope is negative two divided by 1 which is negative 2 which is a negative number so when you update W you get W minus the learning rate times a negative number and so this means you subtract from w a negative number but subtracting a negative number means adding a positive number and so you end up increasing w because subtracting the negative number is the same as adding a positive number to w so this step of gradient descent causes W to increase which means you're moving to the right of the graph and your cos J has decreased down to here and again it looks like gradient descent is doing something reasonable it's getting you closer to the minimum so hopefully these last two examples show some of the intuition behind what the derivative term is doing and why this helps gradient descent change W to get you closer to the minimum I hope this video gave you some sense for why the derivative term in gradient descent makes sense one other key quantity integrating the centavio is the learning rate Alpha how do you choose Alpha what happens if it's too small or what happens if it's too big in the next video Let's Take a deeper look at the parameter Alpha to help build intuitions about what it does as well as how to make a good choice for a good value of alpha for your implementation of creating descentnow let's dive more deeply into gradient descent to gain better intuition about what it's doing and why it might make sense here's the gradient descent algorithm that you saw in the previous video and as a reminder this variable this Greek symbol Alpha is the learning rate and the learning rate controls how big of a step you take when updating the model's parameters W and B so this term here this D over d w this is the derivative term and by convention in math this D is written with this funny font here and in case anyone watching this has a PHD in math or as an expert in multivariate calculus they may be wondering that's not the derivative that's the partial derivative and yes they'd be right but for the purposes of implementing a machine learning algorithm I'm just going to call it derivative and don't worry about these little distinctions so what we're going to focus on now is get more intuition about what this learning rate and what this derivative are doing and why when multiplied together like this it results in updates to parameters W and B that makes sense in order to do this let's use a slightly simpler example where we work on minimizing just one parameter so let's say that you have a cost function J of just one parameter W where W is a number this means that gradient descents now looks like this W is updated to W minus the learning rate Alpha times D over d w of J of w and you're trying to minimize the cost by adjusting the parameter w so this is like our previous example where we had temporarily set b equal to zero with one parameter W instead of two you can look at two dimensional graphs of the cost function j instead of three dimensional graphs let's look at what gradient descents does on this function J of w here are the horizontal axis is parameter W and on the vertical axis is the cost J of w now let's initialize gradient descent with some starting value for w let's initialize it at this location so imagine that you start off at this point right here on the function J what gradient descents would do is it will update W to be W minus learning rate Alpha times D over d w of J of w let's look at what this derivative term here means a way to think about the derivative at this point on the line is to draw a tangent line which is a straight line that touches this curve at that point enough the slope of this line is the derivative of the function J at this point and to get the slope you can draw another triangle like this and if you compute the height divided by the width of this triangle that is the slope so for example the slope might be you know two over one for instance and when the tangent line is pointing up into the right the slope is positive which means that this derivative is a positive number so is greater than zero and so the updated W is going to be W minus the learning rate times some positive number the learning rate is always a positive number so if you take W minus a positive number you end up with a new value for w that is smaller so on the graph we are moving to the left we are decreasing the value of w and you may notice that this is the right thing to do if you go is to decrease the cost J because when we move toward the left on this curve the COS J decreases and you're getting closer to the minimum for J which is over here so so far gradient descent seems to be doing the right thing now let's look at another example let's take the same function G of WS above and now let's say that you initialize gradient descent at a different location Say by choosing a starting value for w that's over here on the left so that's this point of the function J now the derivative term remember is D over d w of J of w and when we look at the tangent line at this point over here the slope of this line is the derivative of J at this point but this tangent line is sloping down into the right and so this line sloping down into the right has a negative slope in other words the derivative k at this point is a negative number for instance if you draw a triangle then the height like this is negative 2 and the width is one so the slope is negative two divided by 1 which is negative 2 which is a negative number so when you update W you get W minus the learning rate times a negative number and so this means you subtract from w a negative number but subtracting a negative number means adding a positive number and so you end up increasing w because subtracting the negative number is the same as adding a positive number to w so this step of gradient descent causes W to increase which means you're moving to the right of the graph and your cos J has decreased down to here and again it looks like gradient descent is doing something reasonable it's getting you closer to the minimum so hopefully these last two examples show some of the intuition behind what the derivative term is doing and why this helps gradient descent change W to get you closer to the minimum I hope this video gave you some sense for why the derivative term in gradient descent makes sense one other key quantity integrating the centavio is the learning rate Alpha how do you choose Alpha what happens if it's too small or what happens if it's too big in the next video Let's Take a deeper look at the parameter Alpha to help build intuitions about what it does as well as how to make a good choice for a good value of alpha for your implementation of creating descent\n"