Soft Max Classification with Gradient Descent
The soft max function is a key component of many classification algorithms, including softmax regression and soft max classification. It allows for the computation of probabilities for each class, rather than just binary 0/1 outputs. In this context, we want to minimize the loss between our model's predictions and the true labels.
The loss on a single training example is given by the cross-entropy function, which is defined as -log(yhat), where y is the true label and yhat is the predicted probability of that class. This means that if your learning algorithm is trying to make this small because you want to use gradient descent to reduce the loss on your training set, then the only way to make this small is to make yhat as big as possible.
However, since probabilities cannot be bigger than 1, we take the soft max function of our output. This means that instead of having a single predicted probability for each class, we have a vector of four probabilities, one for each class. The soft max function is defined as e^(zj)/sum(e^(zi)), where zj is the jth element of our input vector, and sum(e^(zi)) is the sum of all elements in the exponent.
The cost J on the entire training set is calculated by summing up the losses over all examples. So, we want to minimize this cost using gradient descent. The key step of gradient descent is to compute the derivative with respect to each weight. In our case, this turns out to be a simple computation: DZ = yhat - y.
Notice that all of these are going to be 4 by 1 vectors when you have four classes, and sij by 1 in a more general case. This is the partial derivative of the cost function with respect to Zj, where j is an index into our input vector.
The gradient descent algorithm works as follows: we initialize all weights randomly, then compute our predictions and calculate the loss for each example using the cross-entropy function. We also compute the derivatives needed for backpropagation. Then we update our weights by subtracting a fraction of the derivative times a learning rate from each weight. This process is repeated many times until convergence.
One additional implementational detail worth noting is that in a vectorized implementation, matrix capital Y would be y1 y2 ... yM stacked horizontally, where M is the number of classes. Similarly, Yhat would also be yhat1 ... yhatM stacked horizontally. So if you're using a vectorized implementation, this means that each element of our input vector corresponds to a different class.
Now let's take a look at how you implement gradient descent when you have a soft max output layer. The output layer computes ZL, which is C by 1, where C is the number of classes. We apply the softmax activation function to get a L or Yhat, and then that in turn allows us to compute the loss.
The key step of the backpropagation algorithm is to initialize the derivatives needed for backpropagation, which turns out to be this expression: DZ = yhat - y. Notice that all of these are going to be 4 by 1 vectors when you have four classes, and sij by 1 in a more general case.
This formula will also just work fine if you ever need to implement softmax regression or soft max classification from scratch. However, for the purposes of this week's programming exercise, we'll start using one of the deep learning programming frameworks, which will take care of computing the derivatives needed for backpropagation.
The final part of implementing gradient descent is updating our weights using the computed derivative and a learning rate. We initialize all weights randomly, then compute our predictions and calculate the loss for each example using the cross-entropy function. We also compute the derivatives needed for backpropagation. Then we update our weights by subtracting a fraction of the derivative times a learning rate from each weight.
This process is repeated many times until convergence. Now that you've learned how to implement gradient descent with soft max classification, let's move on to some deep learning programming frameworks which can make implementing deep learning algorithms much more efficient.
"WEBVTTKind: captionsLanguage: enin the last video you learn about the softmax there in the softmax activation function in this video you deepen your understanding of softmax classification and also learn how to train a model that uses a soft mask layer recall our earlier example where the open layer computes 0 as follows so there are four classes sequels for then zeros can be 4 by 1 dimensional vector and we said we compute T which is this temporary variable that performs element wise exponentiation and then finally if the activation function for your output layer G of L is the softmax activation function then the output will be this basically taking the temporary variable T and normalizing it to sum to 1 so this then becomes a of L so you notice that in the Z vector the biggest element was 5 and the biggest probability ends up being dispersed probability the name soft mass comes from contrasting it to what's called a hard max which would have taken the vector Z and mapped it to this vector so hard max function will look at the elements of Z and just put a 1 in the position of the biggest elements of Z and then zeros everywhere else and so there is a very hard max where the bigger element gets a output of 1 and everything else gives an output of 0 whereas in contrast the soft max is a more gentle mapping from Z to these probabilities so I'm not sure if this is a great name but that Lisa does the intuition behind why we call it a soft max always in contrast to the hard max and one thing I did really show but as alluded to is that softmax regression or the softmax activation function generalizes the logistics activation function to see constants rather than just two courses and it turns out that if C is equal to 2 then soft max with C equals to 2 essentially reduces to logistic regression and I'm not prove this in this video but the rough outline for the prove is that is C equals to two and if you apply softmax then the output layer al well I'll put two numbers the C equals two so maybe it outputs 0.842 and 0.158 right these two numbers always happens on to one and because these two numbers always at the center one they're actually redundant and maybe don't need to bother to compute two of them maybe we just need to compute one of them and it turns out that the way you end up computing that number reduces to the way that logistic regression is computing this single whole point so that wasn't much of a proof but the takeaway from this is that top X regression is a generalization of logistic regression to more than two classes now let's look at how you would actually train a neural network with a soft max output layer so in particular let's define the loss function you use to Train in your network let's take an example let's say you have an example in your training sets where the output is where the target output the grouchy label is zero one zero zero so the example from the previous video this means that this is an image of a cat because it falls into plus one and now let's say that your new network is currently outputting y hat equals so Y hat will be a vector of probabilities goes sum to one or point one zero point four so you can check that sum to 1 and this is going to be K L so the new net was not doing very well in this example because there's actually cat to miss and only a 20% chance that this is a cat so didn't do very well in this example so what's the last function you want to use to train this new network in softmax qualification the last we typically use is an extra sum of J equals 1 to 4 and it's really sum from 1 for C in the general case and I just use for here of Y log or YJ log Y hat ok so let's look at our single example above to better understand what happens notice that in this example y 1 equals y 3 equals y 4 equals 0 because those are zeros and only y 2 is equal to 1 so if you look at this summation all the terms with 0 values of YJ u equal to 0 and the only term you're left with is negative Y 2 log Y hat 2 because when you sum over the indices of J all the terms will end up 0 except when J is equal to 2 and because y 2 is equal to 1 this is just negative log Y hat - so what this means is that if your learning algorithm is trying to make this small because use gradient descent to you know try to reduce the loss on your training set then the only way to make this small is to make this small and the only way to do that is to make Y hat to as big as possible and these are probabilities so it can never be bigger than 1 but this kind of makes sense because if X for this example is a picture of a cat then you want to that output probability to be as big as possible so more generally what does loss function does is it looks at whatever is the ground truth class in your training set and it tries to meet the corresponding probability of that Clause as high as possible if you're familiar with maximum likelihood estimation statistics this turns out to be a form with maximum likely estimation but if you don't know what that means don't worry about it the intuition we just talked about will suffice now this is the loss on a single training example how about the cost J on the entire training set so the cost of the setting the parameters you know and so on of all the ways and biases you define that as pretty much what you guess some will be on Thai training set of the loss your learning algorithms predictions summed over your training examples and so what you do is use gradient descent in order to try to minimize this cost finally one more implementational detail notice is that would because sequel is equal to 4 y is a 4 by 1 vector and Y hat is also a 4 by 1 so if you are using a vectorized implementation the matrix capital y is going to be y 1 y 2 to y em stacked horizontally and so for example if this example up here is your first training example then the first column of this matrix Y will be 0 1 0 0 and then your second example in the second example is a dog who the third example is a none of the above and so on and then this matrix capital y will end up being a 4 by M dimensional matrix and similarly Y hat will be y hat one stack up horizontally going through Y hat M so this is actually why I had one or the output on the first training example then so I had to read this open 3 0.2 0.1 0.4 and so on and why hide yourself will also be for my M dimensional matrix finally let's take a look at how you implement gradient descent when you have a soft max output layer so this output layer will compute Z L which is C by 1 run our example 4 by 1 and then you apply the softmax activation function to get a L or Y hat and then that in turn allows you to compute the loss so we've talked about how to implement the forward propagation step of the neural network to get these outputs and to compute that loss how about the back propagation step or gradient descent turns out that the key step of the queue equation you need to initialize back prop is this expression that the derivative with respect to Z at the last layer this turns out you can compute this Y hat the 4 by 1 vector minus y the 4.1 Vestas you notice that all of these are going to be four by one vectors when you have four classes and si by one in a more general case and so this screen by our usual definition of what is DZ this is the partial derivative of the cost function with respect to ZL if you're an expert in calculus you can derive this yourself or if you explain calculus you can try to divide this yourself that using this formula will also just work fine if you ever need to enter in this from scratch both this you can then compute D ZL and then sort of start off the background process to compute all the derivatives you need throughout your neural network but it turns out that in this week's programming exercise we'll start to use one of the deep learning programming frameworks and for those foreign frameworks usually it turns out you just need to focus on getting the for profit right and so long as you specify the program where the four top parts the pruning framework will figure out how to do back prop or how to do the backward pass for you so this expression is worth keep in mind for if you ever need to implement softmax regression or soft by classification from scratch although you won't actually need this industries from exercise because the programming framework you use will take care of this derivative computation for you so that's it for soft max classification with it you can now implement learning algorithms to catalyze the inference into not just one of two classes but one of the different classes next I want to show you some of the deep learning programming frameworks which can meet you much more efficient in terms of implementing deep learning algorithms let's go onto the next video to discuss thatin the last video you learn about the softmax there in the softmax activation function in this video you deepen your understanding of softmax classification and also learn how to train a model that uses a soft mask layer recall our earlier example where the open layer computes 0 as follows so there are four classes sequels for then zeros can be 4 by 1 dimensional vector and we said we compute T which is this temporary variable that performs element wise exponentiation and then finally if the activation function for your output layer G of L is the softmax activation function then the output will be this basically taking the temporary variable T and normalizing it to sum to 1 so this then becomes a of L so you notice that in the Z vector the biggest element was 5 and the biggest probability ends up being dispersed probability the name soft mass comes from contrasting it to what's called a hard max which would have taken the vector Z and mapped it to this vector so hard max function will look at the elements of Z and just put a 1 in the position of the biggest elements of Z and then zeros everywhere else and so there is a very hard max where the bigger element gets a output of 1 and everything else gives an output of 0 whereas in contrast the soft max is a more gentle mapping from Z to these probabilities so I'm not sure if this is a great name but that Lisa does the intuition behind why we call it a soft max always in contrast to the hard max and one thing I did really show but as alluded to is that softmax regression or the softmax activation function generalizes the logistics activation function to see constants rather than just two courses and it turns out that if C is equal to 2 then soft max with C equals to 2 essentially reduces to logistic regression and I'm not prove this in this video but the rough outline for the prove is that is C equals to two and if you apply softmax then the output layer al well I'll put two numbers the C equals two so maybe it outputs 0.842 and 0.158 right these two numbers always happens on to one and because these two numbers always at the center one they're actually redundant and maybe don't need to bother to compute two of them maybe we just need to compute one of them and it turns out that the way you end up computing that number reduces to the way that logistic regression is computing this single whole point so that wasn't much of a proof but the takeaway from this is that top X regression is a generalization of logistic regression to more than two classes now let's look at how you would actually train a neural network with a soft max output layer so in particular let's define the loss function you use to Train in your network let's take an example let's say you have an example in your training sets where the output is where the target output the grouchy label is zero one zero zero so the example from the previous video this means that this is an image of a cat because it falls into plus one and now let's say that your new network is currently outputting y hat equals so Y hat will be a vector of probabilities goes sum to one or point one zero point four so you can check that sum to 1 and this is going to be K L so the new net was not doing very well in this example because there's actually cat to miss and only a 20% chance that this is a cat so didn't do very well in this example so what's the last function you want to use to train this new network in softmax qualification the last we typically use is an extra sum of J equals 1 to 4 and it's really sum from 1 for C in the general case and I just use for here of Y log or YJ log Y hat ok so let's look at our single example above to better understand what happens notice that in this example y 1 equals y 3 equals y 4 equals 0 because those are zeros and only y 2 is equal to 1 so if you look at this summation all the terms with 0 values of YJ u equal to 0 and the only term you're left with is negative Y 2 log Y hat 2 because when you sum over the indices of J all the terms will end up 0 except when J is equal to 2 and because y 2 is equal to 1 this is just negative log Y hat - so what this means is that if your learning algorithm is trying to make this small because use gradient descent to you know try to reduce the loss on your training set then the only way to make this small is to make this small and the only way to do that is to make Y hat to as big as possible and these are probabilities so it can never be bigger than 1 but this kind of makes sense because if X for this example is a picture of a cat then you want to that output probability to be as big as possible so more generally what does loss function does is it looks at whatever is the ground truth class in your training set and it tries to meet the corresponding probability of that Clause as high as possible if you're familiar with maximum likelihood estimation statistics this turns out to be a form with maximum likely estimation but if you don't know what that means don't worry about it the intuition we just talked about will suffice now this is the loss on a single training example how about the cost J on the entire training set so the cost of the setting the parameters you know and so on of all the ways and biases you define that as pretty much what you guess some will be on Thai training set of the loss your learning algorithms predictions summed over your training examples and so what you do is use gradient descent in order to try to minimize this cost finally one more implementational detail notice is that would because sequel is equal to 4 y is a 4 by 1 vector and Y hat is also a 4 by 1 so if you are using a vectorized implementation the matrix capital y is going to be y 1 y 2 to y em stacked horizontally and so for example if this example up here is your first training example then the first column of this matrix Y will be 0 1 0 0 and then your second example in the second example is a dog who the third example is a none of the above and so on and then this matrix capital y will end up being a 4 by M dimensional matrix and similarly Y hat will be y hat one stack up horizontally going through Y hat M so this is actually why I had one or the output on the first training example then so I had to read this open 3 0.2 0.1 0.4 and so on and why hide yourself will also be for my M dimensional matrix finally let's take a look at how you implement gradient descent when you have a soft max output layer so this output layer will compute Z L which is C by 1 run our example 4 by 1 and then you apply the softmax activation function to get a L or Y hat and then that in turn allows you to compute the loss so we've talked about how to implement the forward propagation step of the neural network to get these outputs and to compute that loss how about the back propagation step or gradient descent turns out that the key step of the queue equation you need to initialize back prop is this expression that the derivative with respect to Z at the last layer this turns out you can compute this Y hat the 4 by 1 vector minus y the 4.1 Vestas you notice that all of these are going to be four by one vectors when you have four classes and si by one in a more general case and so this screen by our usual definition of what is DZ this is the partial derivative of the cost function with respect to ZL if you're an expert in calculus you can derive this yourself or if you explain calculus you can try to divide this yourself that using this formula will also just work fine if you ever need to enter in this from scratch both this you can then compute D ZL and then sort of start off the background process to compute all the derivatives you need throughout your neural network but it turns out that in this week's programming exercise we'll start to use one of the deep learning programming frameworks and for those foreign frameworks usually it turns out you just need to focus on getting the for profit right and so long as you specify the program where the four top parts the pruning framework will figure out how to do back prop or how to do the backward pass for you so this expression is worth keep in mind for if you ever need to implement softmax regression or soft by classification from scratch although you won't actually need this industries from exercise because the programming framework you use will take care of this derivative computation for you so that's it for soft max classification with it you can now implement learning algorithms to catalyze the inference into not just one of two classes but one of the different classes next I want to show you some of the deep learning programming frameworks which can meet you much more efficient in terms of implementing deep learning algorithms let's go onto the next video to discuss that\n"