#34 Machine Learning Specialization [Course 1, Week 3, Lesson 2]

The f is always between 0 and 1 because the output of logistic regression is always between 0 and 1. The only part of the function that's relevant is therefore this part over here corresponding to F between 0 and 1. So let's zoom in and take a closer look at this part of the graph if the algorithm predicts a probability close to one and the true label is one then the loss is very small it's pretty much zero because you're very close to the right answer.

Now, continue with the example of the true label y being one so say it really is a malignant tumor. If the algorithm predicts 0.5 then the loss is at this point here which is a bit higher but not that high whereas in contrast if the algorithm were to have outputs 0.1 if it thinks that there's only a 10 chance of the tumor being malignant but y really is one it really is malignant then the loss is this much higher value over here. So when Y is equal to 1 the loss function incentivizes or nudges or it helps push the algorithm to make more accurate predictions because the loss is lowest when it predicts values close to 1.

On this slide, we'll be looking at what the loss is when Y is equal to 1. On this slide let's look at the second part of the loss function corresponding to when Y is equal to zero in this case the loss is negative log of 1 minus f of x when this function is plotted it actually looks like this. The range of f is limited to 0 to 1 because logistic regression only outputs values between 0 and 1. And if we zoom in this is what it looks like so in this plot corresponding to y equals zero the vertical axis shows the value of the loss for different values of f of x so when f is 0 or very close to zero the loss is also going to be very small which means that if the true label is 0 and the model's prediction is very close to zero well you nearly got it right so the loss is appropriately very close to zero.

The larger the value of f of x gets the bigger the loss because the prediction is further from the true label zero and in fact as that prediction approaches one the loss actually approaches Infinity going back to the tumor prediction example. This is if a model predicts that the patient's tumor is almost certain to be malignant say 99.9 chance of malignancy but it turns out to actually not be malignant so y equals zero then we penalize the model with a very high loss. So in this case of y equals zero similar to the case of y equals one on the previous slide the further the prediction f of x is away from the true value of y the higher the loss and in fact if f of x approaches zero the loss here actually goes really really large and in fact approaches Infinity.

So when the true label is one the album is strongly incentivized not to predict something too close to zero. So we've seen a lot in this video in the next video let's go back and take the loss function for a single chain example and use that to define the overall cost function for the entire training set and we'll also figure out a simpler way to write out the cost function which will then later allow us to run gradient descent to find good parameters for logistic regression.

"WEBVTTKind: captionsLanguage: enremember that the cost function gives you a way to measure how well a specific set of parameters fits the training data and it thereby gives you a way to try to choose better parameters in this video we'll look at how the squared error cost function is not an ID cost function for religious regression and we'll take a look at the different cost function that can help us choose better parameters for logistic regression here's what the training sets for a logistic regression model might look like where here each row might correspond to a patients that was paying a visit to the doctor and wound up with some sort of diagnosis as before we'll use M to denote the number of training examples each training example has one or more features such as the tumor size the patient's age and so on for a total of n features and so let's call the features X1 through xn and since this is a binary classification task the target label y takes on only two values either 0 or 1 and finally the logistic regression model is defined by this equation okay so the question you want to answer is given this training set how can you choose parameters W and B recall for linear regression this is the squared error cost function the only thing I've changed is that I put the one half inside the summation instead of outside the summation and you might remember that in the case of linear regression where f of x is the linear function w dot X plus b the cost function looks like this is a convex function or a bow shape or a hammer shape and so gradient descents will look like this where you take one step one step one step and so on to converge at the global minimum now you could try to use the same cost function for logistic regression but it turns out that if I were to write f of x equals 1 over 1 plus e to the negative W X plus b and plot the cost function using this value of f of x then the cost will look like this this becomes what's called a non-convex cost function it's not context and what this means is that if you were to try to use gradient descent there are lots of local Minima that you can get suck in so it turns out that for logistic regression this squared error cost function is not a good choice instead there will be a different cost function that can make the cost function convex again so the gradient descent can be guaranteed to converge to the global minimum the only thing I've changed is that I put the one half inside the summation instead of outside the summation this will make the math you see later on this slide a little bit simpler in order to build a new cost function one that will use for logistic regression I'm going to change a little bit the definition of the cost function J of w and B in particular if you look inside this summation let's call this term inside the loss on a single training example and I'm going to denote the loss via this capital L and is a function of the prediction of the learning algorithm f of x as well as of the true label y and so the loss given the predicted f of x and the true label Y is equal in this case to one half of the squared difference we'll see shortly that by choosing a different form for this loss function we'll be able to keep the overall cost function which is 1 over M times the sum of these loss functions to be a convex function now the loss function inputs f of x and the true label Y and tells us how well we're doing on that example I'm going to just write down here at the definition of the loss function we'll use for logistic regression if the label Y is equal to 1 then the loss is negative log of f of x and if the label Y is equal to 0 then the loss is negative log of 1 minus f of x let's take a look at why this loss function hopefully makes sense let's first consider the case of y equals 1 and plots what this function looks like to gain some intuition about what this loss function is doing and remember the loss function measures how well you're doing on one training example and is by summing up the losses on all of the training examples that you then get the cost function which measures how well you're doing on the entire training set so if you plot log of f it looks like this curve here where F here is on the horizontal axis and so a plot of negative of the log of f looks like this where we just flip the curve along the horizontal axis notice that it intersects the horizontal axis at f equals one and continues downward from there now f is the output of logistic regression thus f is always between 0 and 1 because the output of logistic regression is always between 0 and 1. the only part of the function that's relevant is therefore this part over here corresponding to F between 0 and 1. so let's zoom in and take a closer look at this part of the graph if the algorithm predicts a probability close to one and the true label is one then the loss is very small it's pretty much zero because you're very close to the right answer now continue with the example of the true label y being one so say it really is a malignant tumor if the algorithm predicts 0.5 then the loss is at this point here which is a bit higher but not that high whereas in contrast if the algorithm were to have outputs 0.1 if it thinks that there's only a 10 chance of the tumor being malignant but y really is one it really is malignant then the loss is this much higher value over here so when Y is equal to 1 the loss function incentivizes or nudges or it helps push the algorithm to make more accurate predictions because the loss is lowest when it predicts values close to 1. now on this slide we'll be looking at what the loss is when Y is equal to 1. on this slide let's look at the second part of the loss function corresponding to when Y is equal to zero in this case the loss is negative log of 1 minus f of x when this function is plotted it actually looks like this the range of f is limited to 0 to 1 because logistic regression only outputs values between 0 and 1. and if we zoom in this is what it looks like so in this plot corresponding to y equals zero the vertical axis shows the value of the loss for different values of f of x so when f is 0 or very close to zero the loss is also going to be very small which means that if the true label is 0 and the model's prediction is very close to zero well you nearly got it right so the loss is appropriately very close to zero and the larger the value of f of x gets the bigger the loss because the prediction is further from the true label zero and in fact as that prediction approaches one the loss actually approaches Infinity going back to the tumor prediction example this is if a model predicts that the patient's tumor is almost certain to be malignant say 99.9 chance of malignancy but it turns out to actually not be malignant so y equals zero then we penalize the model with a very high loss so in this case of y equals zero similar to the case of y equals one on the previous Slide the further the prediction f of x is away from the true value of y the higher the loss and in fact if f of x approaches zero the loss here actually goes really really large and in fact approaches Infinity so when the true label is one the album is strongly incentivized not to predict something too close to zero so in this video you saw why the squared error cost function doesn't work well for logistic regression we also defined the loss for a single training example and came up with a new definition for the loss function for logistic regression it turns out that with this choice of loss function the overall cost function will be convex and thus you can reliably use gradient descent to take you to the global minimum proving that this function is convex is beyond the scope of this course you may remember that the cost function is a function of the entire training set and is therefore the average or one of M times the sum of the loss function on the individual training examples so the cost on a certain set of parameters W and B is equal to 1 over M times the sum over all the training examples of the loss on the training examples and if you can find the value of the parameters W and B that minimizes this then you'd have a pretty good set of values for the parameters W and B for logistic regression in the upcoming optional lab you get to take a look at how the squared error cost function doesn't work very well for classification because you see that the surface plot results in a very weakly cost surface with many local Minima then you take a look at the new logistic loss function and as you can see here this produces a nice and smooth convex surface plot that does not have all those local Minima so please take a look at the code and the plots after this video alright so we've seen a lot in this video in the next video let's go back and take the loss function for a single chain example and use that to define the overall cost function for the entire trading set and we'll also figure out a simpler way to write out the cost function which will then later allow us to run gradient descent to find good parameters for logistic regression let's go on to the next videoremember that the cost function gives you a way to measure how well a specific set of parameters fits the training data and it thereby gives you a way to try to choose better parameters in this video we'll look at how the squared error cost function is not an ID cost function for religious regression and we'll take a look at the different cost function that can help us choose better parameters for logistic regression here's what the training sets for a logistic regression model might look like where here each row might correspond to a patients that was paying a visit to the doctor and wound up with some sort of diagnosis as before we'll use M to denote the number of training examples each training example has one or more features such as the tumor size the patient's age and so on for a total of n features and so let's call the features X1 through xn and since this is a binary classification task the target label y takes on only two values either 0 or 1 and finally the logistic regression model is defined by this equation okay so the question you want to answer is given this training set how can you choose parameters W and B recall for linear regression this is the squared error cost function the only thing I've changed is that I put the one half inside the summation instead of outside the summation and you might remember that in the case of linear regression where f of x is the linear function w dot X plus b the cost function looks like this is a convex function or a bow shape or a hammer shape and so gradient descents will look like this where you take one step one step one step and so on to converge at the global minimum now you could try to use the same cost function for logistic regression but it turns out that if I were to write f of x equals 1 over 1 plus e to the negative W X plus b and plot the cost function using this value of f of x then the cost will look like this this becomes what's called a non-convex cost function it's not context and what this means is that if you were to try to use gradient descent there are lots of local Minima that you can get suck in so it turns out that for logistic regression this squared error cost function is not a good choice instead there will be a different cost function that can make the cost function convex again so the gradient descent can be guaranteed to converge to the global minimum the only thing I've changed is that I put the one half inside the summation instead of outside the summation this will make the math you see later on this slide a little bit simpler in order to build a new cost function one that will use for logistic regression I'm going to change a little bit the definition of the cost function J of w and B in particular if you look inside this summation let's call this term inside the loss on a single training example and I'm going to denote the loss via this capital L and is a function of the prediction of the learning algorithm f of x as well as of the true label y and so the loss given the predicted f of x and the true label Y is equal in this case to one half of the squared difference we'll see shortly that by choosing a different form for this loss function we'll be able to keep the overall cost function which is 1 over M times the sum of these loss functions to be a convex function now the loss function inputs f of x and the true label Y and tells us how well we're doing on that example I'm going to just write down here at the definition of the loss function we'll use for logistic regression if the label Y is equal to 1 then the loss is negative log of f of x and if the label Y is equal to 0 then the loss is negative log of 1 minus f of x let's take a look at why this loss function hopefully makes sense let's first consider the case of y equals 1 and plots what this function looks like to gain some intuition about what this loss function is doing and remember the loss function measures how well you're doing on one training example and is by summing up the losses on all of the training examples that you then get the cost function which measures how well you're doing on the entire training set so if you plot log of f it looks like this curve here where F here is on the horizontal axis and so a plot of negative of the log of f looks like this where we just flip the curve along the horizontal axis notice that it intersects the horizontal axis at f equals one and continues downward from there now f is the output of logistic regression thus f is always between 0 and 1 because the output of logistic regression is always between 0 and 1. the only part of the function that's relevant is therefore this part over here corresponding to F between 0 and 1. so let's zoom in and take a closer look at this part of the graph if the algorithm predicts a probability close to one and the true label is one then the loss is very small it's pretty much zero because you're very close to the right answer now continue with the example of the true label y being one so say it really is a malignant tumor if the algorithm predicts 0.5 then the loss is at this point here which is a bit higher but not that high whereas in contrast if the algorithm were to have outputs 0.1 if it thinks that there's only a 10 chance of the tumor being malignant but y really is one it really is malignant then the loss is this much higher value over here so when Y is equal to 1 the loss function incentivizes or nudges or it helps push the algorithm to make more accurate predictions because the loss is lowest when it predicts values close to 1. now on this slide we'll be looking at what the loss is when Y is equal to 1. on this slide let's look at the second part of the loss function corresponding to when Y is equal to zero in this case the loss is negative log of 1 minus f of x when this function is plotted it actually looks like this the range of f is limited to 0 to 1 because logistic regression only outputs values between 0 and 1. and if we zoom in this is what it looks like so in this plot corresponding to y equals zero the vertical axis shows the value of the loss for different values of f of x so when f is 0 or very close to zero the loss is also going to be very small which means that if the true label is 0 and the model's prediction is very close to zero well you nearly got it right so the loss is appropriately very close to zero and the larger the value of f of x gets the bigger the loss because the prediction is further from the true label zero and in fact as that prediction approaches one the loss actually approaches Infinity going back to the tumor prediction example this is if a model predicts that the patient's tumor is almost certain to be malignant say 99.9 chance of malignancy but it turns out to actually not be malignant so y equals zero then we penalize the model with a very high loss so in this case of y equals zero similar to the case of y equals one on the previous Slide the further the prediction f of x is away from the true value of y the higher the loss and in fact if f of x approaches zero the loss here actually goes really really large and in fact approaches Infinity so when the true label is one the album is strongly incentivized not to predict something too close to zero so in this video you saw why the squared error cost function doesn't work well for logistic regression we also defined the loss for a single training example and came up with a new definition for the loss function for logistic regression it turns out that with this choice of loss function the overall cost function will be convex and thus you can reliably use gradient descent to take you to the global minimum proving that this function is convex is beyond the scope of this course you may remember that the cost function is a function of the entire training set and is therefore the average or one of M times the sum of the loss function on the individual training examples so the cost on a certain set of parameters W and B is equal to 1 over M times the sum over all the training examples of the loss on the training examples and if you can find the value of the parameters W and B that minimizes this then you'd have a pretty good set of values for the parameters W and B for logistic regression in the upcoming optional lab you get to take a look at how the squared error cost function doesn't work very well for classification because you see that the surface plot results in a very weakly cost surface with many local Minima then you take a look at the new logistic loss function and as you can see here this produces a nice and smooth convex surface plot that does not have all those local Minima so please take a look at the code and the plots after this video alright so we've seen a lot in this video in the next video let's go back and take the loss function for a single chain example and use that to define the overall cost function for the entire trading set and we'll also figure out a simpler way to write out the cost function which will then later allow us to run gradient descent to find good parameters for logistic regression let's go on to the next video\n"