Forward and Backward Propagation (C1W4L06)

Implementing Forward and Backward Propagation for Deep Neural Networks

To implement forward propagation for a deep neural network, you start with the input data X. The first layer may have an activation function applied to it, such as ReLU or sigmoid, which outputs Z1. This process continues through each layer of the network, with each layer applying its own activation function and outputting Z2, Z3, and so on. Once all layers have been processed, you can calculate the output of the network, Yhat.

To implement backward propagation for a deep neural network, you start by computing the loss function L. The loss function is typically calculated using the output of the final layer, Yhat, and the target data Y. For binary classification problems, such as logistic regression, the loss function can be written as:

L = -(Y \* log(Yhat) + (1-Y) \* log(1-Yhat))

To compute the derivatives of the loss function with respect to the output of each layer, you need to use backpropagation. The backward recursion starts at the final layer and works its way backwards through each layer, computing the derivative of the loss function with respect to each layer's output.

For a three-layer network, the backward recursion would look like this:

DZ3 = ∂L/∂Yhat

DZ2 = ∂L/∂Z3 \* ∂Z3/∂a2

DZA1 = ∂L/∂Z2 \* ∂Z2/∂a1

where a1, a2, and a3 are the weights of each layer.

To implement the backward recursion for a three-layer network, you can use the following equations:

DW3 = ∑(DZ3 \* Z2) / M

DB3 = ∑(DZ3 \* Yhat) / M

DZA2 = ∑(DZ3 \* ∂Z3/∂a2 \* Z1) / M

DWA2 = ∑(DZ3 \* ∂Z3/∂a2 \* ∂Z2/∂a2) / M

DZA1 = ∑(DZ3 \* ∂Z3/∂a2 \* ∂Z2/∂a1 \* Z0) / M

DWA1 = ∑(DZ3 \* ∂Z3/∂a2 \* ∂Z2/∂a1 \* ∂Z1/∂a1) / M

where Yhat, Z1, Z2, and Z3 are the outputs of each layer, and a1, a2, and a3 are the weights of each layer.

For logistic regression with binary classification, the derivative of the loss function with respect to the output can be written as:

DZ = (Yhat \* Y) / M

To implement backward propagation for a vectorized version of the network, you would initialize the backward recursion with a capital A for the final layer and a small value over a, plus 1 minus y, for each training example. This process continues down to the first layer, where you use a small value over a, plus 1 minus a, for each training example.

The key insight here is that when you're doing binary classification with logistic regression, the derivative of the loss function with respect to the output can be written as:

DZ = (Yhat \* Y) / M

which means that the derivative of the loss function with respect to the output of each layer is simply the output itself.

By initializing the backward recursion with this formula, you can compute the derivatives of the loss function with respect to each layer's output in a single pass through the network. This makes it much more efficient than computing the derivatives one layer at a time.

Overall, implementing forward and backward propagation for deep neural networks requires a good understanding of the underlying mathematics behind the process. By breaking down the process into smaller sections and using clear and concise language, you can make this complex topic more accessible to readers who may be new to the subject.

Hyperparameters and Parameters in Deep Learning

One of the biggest challenges facing deep learning practitioners is managing hyperparameters and parameters. Hyperparameters are tunable variables that control different aspects of a model, such as the learning rate or batch size. Parameters, on the other hand, refer to the model's weights and biases.

To organize hyperparameters and parameters effectively, you can use a variety of techniques. One approach is to define each hyperparameter and parameter with a clear name and description, making it easy to understand what each variable represents.

Another approach is to group related hyperparameters and parameters together into separate sections or modules. For example, you might have a section for the model's weights and biases, another for the learning rate schedule, and a third for the regularization strength.

By organizing your hyperparameters and parameters in this way, you can make it easier to manage and tune different aspects of your model. This is especially important when working with large models or complex architectures, where managing multiple variables can be overwhelming.

Ultimately, the key to effective management of hyperparameters and parameters is to be intentional about how you structure your code and configuration files. By taking a thoughtful and systematic approach, you can ensure that your model is optimized for performance and accuracy.

"WEBVTTKind: captionsLanguage: enin the previous video you saw the basic blocks of implementing a deep neural network a for propagation step for each layer and a corresponding backward propagation step let's see how you can actually implement these steps will start to for propagation recall that what this will do is input a L minus 1 and output a L and the cash ZL and we just said that from implementational point of view maybe we'll cache WL + BL as well just to make the functions call a bit easier in the programming exercise and so the equations for this should already look familiar the way to implement a fourth function is just this equals WL x a l minus 1 plus B L and then a l equals the activation function applied to Z and if you want a vector rise implementation then it's just that times a L minus 1 plus B with the be adding beeping uh python for costing and a l equals he applied element wise to z and remember on the diagram for the forth step right we had this chain of bosses going forward so you initialize that with feeding and a 0 which is equal to X so you know you initialize this really what is the input to the first one right it's really um a zero which is the input features to either for one training example if you're doing one example at a time or a capital zero the entire training set if you are processing the entire training set so that's the input to the first fort function in the chain and then just repeating this allows you to compute forward propagation from left to right next let's talk about the backward propagation step here your goes to input D al and output D al minus 1 and D wo and DB let me just write out the steps you need to compute these things DZ l is equal to da l alamin Weis product with G of L prime Z of L and then computed derivatives DW l equals d ZL times AF l minus 1 i didn't explicitly put that in the cache but it turns out you need this as well and then DB l is equal to DZ l and finally da of L minus 1 there's equal to WL transpose times DZ l ok and I don't want to go through the detailed derivation for this but it turns out that if you take this definition the da and plug it in here then you get the same formula as we had in there previously for how you compute DZ l as a function of the previous DZ ow in fact well if I just plug that in here you end up that DZ L is equal to WL plus 1 transpose DZ l plus 1 times G L prime z FL I know this is a looks like a lot of algebra but actually double check for yourself that this is the equation we had written down for back propagation last week when we were doing in your network with just a single hidden layer and as reminder this times this element-wise product but so all you need is those four equations to implement your backward function and then finally I'll just write out the vectorized version so the first line becomes DZ l equals d a o element-wise product with GL prime of z oh maybe no surprise there DW l becomes 1 over m DZ l times a o minus 1 transpose and then DB l becomes 1 over m and Peter Som DZ L then access equals 1 keep dims equals true we talked about the use of an Peter Som in the previous week to compute DB and the finally da L minus 1 is WL transpose times D Z of L so this allows you to input this quantity da over here and output DW l DP l the derivatives you need as well as da L minus 1 right as follows so that's how you implement the backward function so just to summarize um take the input X you might have the first layer maybe has a rather activation function then go to the second layer maybe uses another value activation function goes to the third layer maybe has a sigmoid activation function if you're doing binary classification and this outputs y hat and then using Y hat you can compute the loss and this allows you to start your backward iteration I draw the arrows for us I guess I don't have to change pens too much where you were then have back prop compute the derivatives compute you know DW 3 DB 3 DW 2 DP 2 DW 1 DB 1 and along the way you would be computing I guess the cash would transfer Z 1 Z 2 Z 3 and here you pass backward da - and da one this could compute da zero but we won't use that so you can just discard that right and so this is how you implement for a prop and back prop for a three-layer your network now there's just one last detail that I didn't talk about which is for the forward recursion we would initialize it with the input data X how about the backward recursion well it turns out that D a of L when you're using logistic regression when you're doing binary classification is equal to Y over a plus 1 minus y over 1 minus a so turns out that the derivative of the loss function with respect to the output we're expected Y hat can be shown to be equal to dis if you're familiar with calculus if you look up the loss function L and take derivatives respect to Y hat and respect to a you can show that you get that formula so this is the formula that you should use for da for the final layer capital L and of course if you were to have a vectorized implementation then you initialize the backward recursion not with this there will be a capital A for the layer L which is going to be you know the same thing for the different examples right over a for the first training example plus 1 minus y for the first training example over 1 minus a for the first training example both on top down to the M training example so 1 minus a of them so that's how you to implement the vectorized version that's how you initialize the vectorized version of backpropagation so you've now seen the basic building blocks of both for propagation as well as back propagation now if you implement these equations you will get a correct implementation for prop and back prop to get you the derivatives you need you might be thinking wow those are all equations I'm slightly confused I'm not quite sure I see how this works and if you're feeling that way my advice is when you get to this week's programming you will be able to implement these for yourself and there'll be much more concrete and I know there was a lot of equations and maybe some of equations didn't make complete sense but if you work through the calculus and the linear algebra which is not easy so you know feel free to try but that's actually a bundle more difficult derivations in machine learning it turns out the equations roll down at just the calculus equations for computing the derivatives especially in background but once again if this feels a little bit abstract a little bit mysterious to you my advice is when you've done their prime exercise it will feel a bit more concrete to you although I have to say you know even today when I implement a learning algorithm sometimes even I'm surprised when my learning algorithm implementation works and it's because lot of the complexity of machine learning comes from the j-turn rather than from the lines of code so sometimes you feel like you implement a few lines of code not question what it did but there's almost magically works and because of all the magic is actually not in the piece of code you write which is often you know not too long it's not it's not exactly simple but there's not you know ten thousand a hundred thousand lines of code but you feed it so much data that sometimes even don't work the machine learning for a long time sometimes it's so you know surprises me a bit when my learning algorithm works because a lot of the complexity of your learning algorithm comes from the data rather than necessarily from your writing you know thousands and thousands of lines of code all right so that's um how you implement deep neural networks in the game this will become more concrete when you've done their primary exercise before moving on I want to discuss in the next video want to discuss hyper parameters and parameters it turns out that when you're training deep nets being able to organize your hyper parameters as well will help you be more efficient in developing your networks in the next video let's talk about exactly what that meansin the previous video you saw the basic blocks of implementing a deep neural network a for propagation step for each layer and a corresponding backward propagation step let's see how you can actually implement these steps will start to for propagation recall that what this will do is input a L minus 1 and output a L and the cash ZL and we just said that from implementational point of view maybe we'll cache WL + BL as well just to make the functions call a bit easier in the programming exercise and so the equations for this should already look familiar the way to implement a fourth function is just this equals WL x a l minus 1 plus B L and then a l equals the activation function applied to Z and if you want a vector rise implementation then it's just that times a L minus 1 plus B with the be adding beeping uh python for costing and a l equals he applied element wise to z and remember on the diagram for the forth step right we had this chain of bosses going forward so you initialize that with feeding and a 0 which is equal to X so you know you initialize this really what is the input to the first one right it's really um a zero which is the input features to either for one training example if you're doing one example at a time or a capital zero the entire training set if you are processing the entire training set so that's the input to the first fort function in the chain and then just repeating this allows you to compute forward propagation from left to right next let's talk about the backward propagation step here your goes to input D al and output D al minus 1 and D wo and DB let me just write out the steps you need to compute these things DZ l is equal to da l alamin Weis product with G of L prime Z of L and then computed derivatives DW l equals d ZL times AF l minus 1 i didn't explicitly put that in the cache but it turns out you need this as well and then DB l is equal to DZ l and finally da of L minus 1 there's equal to WL transpose times DZ l ok and I don't want to go through the detailed derivation for this but it turns out that if you take this definition the da and plug it in here then you get the same formula as we had in there previously for how you compute DZ l as a function of the previous DZ ow in fact well if I just plug that in here you end up that DZ L is equal to WL plus 1 transpose DZ l plus 1 times G L prime z FL I know this is a looks like a lot of algebra but actually double check for yourself that this is the equation we had written down for back propagation last week when we were doing in your network with just a single hidden layer and as reminder this times this element-wise product but so all you need is those four equations to implement your backward function and then finally I'll just write out the vectorized version so the first line becomes DZ l equals d a o element-wise product with GL prime of z oh maybe no surprise there DW l becomes 1 over m DZ l times a o minus 1 transpose and then DB l becomes 1 over m and Peter Som DZ L then access equals 1 keep dims equals true we talked about the use of an Peter Som in the previous week to compute DB and the finally da L minus 1 is WL transpose times D Z of L so this allows you to input this quantity da over here and output DW l DP l the derivatives you need as well as da L minus 1 right as follows so that's how you implement the backward function so just to summarize um take the input X you might have the first layer maybe has a rather activation function then go to the second layer maybe uses another value activation function goes to the third layer maybe has a sigmoid activation function if you're doing binary classification and this outputs y hat and then using Y hat you can compute the loss and this allows you to start your backward iteration I draw the arrows for us I guess I don't have to change pens too much where you were then have back prop compute the derivatives compute you know DW 3 DB 3 DW 2 DP 2 DW 1 DB 1 and along the way you would be computing I guess the cash would transfer Z 1 Z 2 Z 3 and here you pass backward da - and da one this could compute da zero but we won't use that so you can just discard that right and so this is how you implement for a prop and back prop for a three-layer your network now there's just one last detail that I didn't talk about which is for the forward recursion we would initialize it with the input data X how about the backward recursion well it turns out that D a of L when you're using logistic regression when you're doing binary classification is equal to Y over a plus 1 minus y over 1 minus a so turns out that the derivative of the loss function with respect to the output we're expected Y hat can be shown to be equal to dis if you're familiar with calculus if you look up the loss function L and take derivatives respect to Y hat and respect to a you can show that you get that formula so this is the formula that you should use for da for the final layer capital L and of course if you were to have a vectorized implementation then you initialize the backward recursion not with this there will be a capital A for the layer L which is going to be you know the same thing for the different examples right over a for the first training example plus 1 minus y for the first training example over 1 minus a for the first training example both on top down to the M training example so 1 minus a of them so that's how you to implement the vectorized version that's how you initialize the vectorized version of backpropagation so you've now seen the basic building blocks of both for propagation as well as back propagation now if you implement these equations you will get a correct implementation for prop and back prop to get you the derivatives you need you might be thinking wow those are all equations I'm slightly confused I'm not quite sure I see how this works and if you're feeling that way my advice is when you get to this week's programming you will be able to implement these for yourself and there'll be much more concrete and I know there was a lot of equations and maybe some of equations didn't make complete sense but if you work through the calculus and the linear algebra which is not easy so you know feel free to try but that's actually a bundle more difficult derivations in machine learning it turns out the equations roll down at just the calculus equations for computing the derivatives especially in background but once again if this feels a little bit abstract a little bit mysterious to you my advice is when you've done their prime exercise it will feel a bit more concrete to you although I have to say you know even today when I implement a learning algorithm sometimes even I'm surprised when my learning algorithm implementation works and it's because lot of the complexity of machine learning comes from the j-turn rather than from the lines of code so sometimes you feel like you implement a few lines of code not question what it did but there's almost magically works and because of all the magic is actually not in the piece of code you write which is often you know not too long it's not it's not exactly simple but there's not you know ten thousand a hundred thousand lines of code but you feed it so much data that sometimes even don't work the machine learning for a long time sometimes it's so you know surprises me a bit when my learning algorithm works because a lot of the complexity of your learning algorithm comes from the data rather than necessarily from your writing you know thousands and thousands of lines of code all right so that's um how you implement deep neural networks in the game this will become more concrete when you've done their primary exercise before moving on I want to discuss in the next video want to discuss hyper parameters and parameters it turns out that when you're training deep nets being able to organize your hyper parameters as well will help you be more efficient in developing your networks in the next video let's talk about exactly what that means\n"