**Understanding Deep Learning and Optimization**
In deep learning, optimization is a crucial step in training models to minimize loss and find optimal parameters. The goal is to optimize the weights between layers, known as w, such that it satisfies certain requirements.
**Adding Constraints to the Optimization Problem**
One common requirement in deep learning is to ensure that the weight matrix between two layers satisfies a specific condition. In this case, we want w dot w transpose minus i to be equal to zero and also for w transpose w minus i to be equal to zero. This means that we want the weight matrix to satisfy these conditions.
**Using Frobenius Norm**
To enforce these constraints, we can use the Frobenius norm, which is a measure of the magnitude of a matrix. The square of the Frobenius norm of a matrix is defined as the sum of the squares of its elements. We want w dot w transpose minus i to have a Frobenius norm squared equal to zero and similarly for w transpose w minus i.
**Hyperparameter Lambda**
To achieve this, we can use a hyperparameter lambda, which tunes the strength of the constraint. If we increase lambda, it will force the weight matrix to be closer to zero. This is a simple way to add constraints to the optimization problem while minimizing loss.
**Mixing Linear Algebra and Optimization**
The solution involves mixing concepts from linear algebra and optimization. We use basic linear algebra techniques such as matrix multiplication and transpose to manipulate the weights, while also applying optimization techniques to minimize loss. By combining these concepts, we can create a framework for optimizing deep learning models that satisfy specific constraints.
**Applying to Deep Learning Algorithms**
The solution is applicable to standard deep learning algorithms, including those used in multi-class classification and regression problems. By incorporating the concept of Frobenius norm and hyperparameter lambda, we can add new constraints to the optimization problem while minimizing loss. This approach allows us to fine-tune our models to satisfy specific requirements.
**Conclusion**
In conclusion, adding constraints to deep learning models involves mixing concepts from linear algebra and optimization. By using the Frobenius norm and a hyperparameter lambda, we can enforce specific conditions on the weight matrix between two layers while minimizing loss. This approach provides a framework for optimizing deep learning models that satisfy certain requirements, making it a valuable tool in the field of machine learning.
"WEBVTTKind: captionsLanguage: enhere is a very interesting interview problem from deep learning and this problem involves deep learning and some linear algebra right so first let me explain the problem and i would like you to think about the solution and then i will go ahead and solve it for you so imagine you have a neural network let's assume it's a standard fully connected right so dense network right so you have some inputs and then let's assume there are two layers of neurons here let's assume this is layer i and this is layer i plus 1 obviously there are some layers before li and there are some layers after li plus 1 also okay i'm just i'm just let's assume there are two layers let's say you meet standard feed forward fully connected densely connected neural network so imagine if there are k activation units here and there are k activation units here let's just assume for simplicity right then there is a weight matrix w here which is a k cross k matrix right so imagine imagine this is my x right goes through multiple layers again one of the layers here has this weight matrix w and then there are multiple layers and then there is loss here obviously right so that i mean there are more layers here also we are just showing with respect to one layer only now in this neural network what we want this neural network to achieve is of course you will you want to minimize loss just like any other neural network but in addition to it we want this w matrix which is basically the matrix of all the connection weights between layer i and layer i plus 1 we want this w matrix to have some properties so look at this w this w has let's assume some columns column 1 column 2 column 3 so on so forth and a bunch of rows right since this is a k cross k matrix it will have k rows and k columns what we want this square matrix this is a square matrix right we want the square matrix to be orthogonal which means we want each of these rows so we want any if you pick any two rows here we want these rows any pair of rows to be perpendicular to each other or orthogonal to each other so we want ci to be perpendicular to cj where i is not equals to j and we want each column vector to be a unit vector similarly we want every row if you pick any two row vectors here right if you pick any two row vectors here we want ri to be perpendicular or orthogonal to r j when i is not equals to j obviously when i is not equals to j and we want each of the row vectors to be also unit vectors right so this is the constraint that we want we want this square matrix w which is the connection which is the which is a matrix of all connections between layer i and layer i plus 1. we want this to be an orthogonal matrix such that every pair of columns are orthogonal and unit vectors and every pair of rows are orthogonal and unit vectors this is the constraint we want now the question here is now how do you train this neural network of course in this neural network you would want to minimize your loss but in addition to minimizing loss you want to satisfy this constraint or this requirement that this w should be an orthogonal matrix satisfying all these constraints now how do you train such a such a model how do you so when you train a model how do you ensure that w follows these requirements that we are asking of it while also minimizing loss okay so i i hope this question is clear so please pause this video here and try to think of a solution on how you will train this neural network to ensure that you are minimizing the loss as well as ensuring that w satisfies the constraints that we want so i'm going to assume that you've put in some time so let's just use some basic linear algebra and then we'll change the optimization problem that we solve when when we're trying to when we're trying to find the weights of a neural network right so we'll change the optimization problem itself cool so first and foremost what is given to us that first let's look at it from the perspective of the rows of w right we want if we want wi to be perpendicular to w j let's assume w i is the i throw okay this is the notation that i am going to use when i say wi it means the i throw when i say w j it is the j throw right so we want wi to be perpendicular to w j whenever i is not equals to j right what does it what does that mean we what does it mean intuitively from a linear algebraic standpoint it means we want the dot product between w i and w j to be 0 whenever i is not equals to j very simple but at the same time we want each wi to also be a unit vector if you look here we want each row to be perpendicular to other rows but each row we want it to be a unit vector also which means wi dot wi we want it to be equal to 1 right this is the definition which satisfies that w i is a unit vector right so we want wi's to satisfy these two requirements right so if if you try to write this in one formula what you get here is w i dot w j the dot product should be equals to 1 if i equals to j or 0 otherwise right so this is what we want to satisfy now this is the vector notation right so wi is a vector wj is a vector here now if you write it in matrix form let's take w w is the matrix that we construct using w i's as rows right so this is what you have this is your w1 this is your w2 right so on so forth your wk right that's a matrix that you have now if if you do a dot product right if you do a product of w with w transpose this is your w so w transpose will have w 1 as column right what will w transpose look like it will have w1 as the column w2 as the second column so on so forth wk as the as the kth column now when you do this product what do you want so if you want to take this take this requirement into consideration w 1 into w 1 you want it to be 1. similarly w 2 into w 2 you want it to be 1 w 3 into w 3 you want it to be 1 and so on and so forth but w 1 into w 2 you want it to be 0 so what you'll construct here if you do a matrix multiplication between w and w transpose what you want to achieve is basically all the diagonal elements to be 1 all the non-diagonal elements to be 0 and such a matrix is called as an identity matrix right all the diagonal elements are 1 all the non-diagonal elements are 0. so what do we want we want w dot w transpose to be equals to the identity matrix of course this identity matrix is a k cross k identity matrix right so this the requirements that are being asked of us for the rows if you want each ri to be perpendicular to r j and if you want each ri to be unit vector that is equal to saying that we want w dot w transpose equals to i this is what we want to achieve again another way to write this is we want w w transpose minus i equals to 0 this is what we want right similarly just like the way we have these requirements for rows we have the requirements for columns also so now now look at this what are columns of w columns of w are nothing but rows of w transpose if you think about it if i just create a new matrix called w transpose right columns of w are nothing but rows of w transpose right so whatever logic we have constructed here same logic we can construct but instead of using w we'll use the matrix w transpose right so now what is w transpose i w transpose i is the i throw of w transpose right which is nothing but the eighth column of w look at this w transpose i here is the ith row of w transpose either of w transpose is nothing but the ith column of w similarly we want w transpose i dot product w transpose j should be equals to 0 when i is not equals to j and w transpose i dot w transpose i should be equals to 1 same logic same logic that we have applied for rows we are now just applying for columns and mathematically the columns of w are same as the rows of w transpose which means again using the same logic that we have applied here if you just apply it on columns what you want now is we want this matrix w transpose multiplied by matrix w to be equal to the identity matrix k cross k so what do we want this this requirement that we have on columns can be written as we want w transpose multiplied by w minus i should be equals to 0 right so look at look at the two constraints we have this is w dot w transpose minus i equals to 0 this is one requirement the other requirement here is w transpose w minus i also should be equals to zero right so when we try to fit the weights right when we try to find when we when we use back propagation as part of deep learning to estimate these weights right by minimizing the loss and by performing the back propagation by using the back propagation algorithm we want w to satisfy this requirement and we want w to also satisfy this now how do we now force your back propagation algorithm or your optimization problem to achieve this it's very simple look at this let's see you theta is all the parameters that you have in your model right so t remember w is just if you notice this if you notice this model w is just the weights between two layers it is basically the weights between these two layers only right all the weights all the weights that you have here but there will be other weights and other parameters that you have in layers before this and layers after that right let's just assume theta is all the parameters that you have so theta is all the parameters that you have in your deep learning model obviously w is part of theta right so what are we trying to minimize we are trying to find the optimal theta or in other words we are trying to find the optimal parameters of the model such that it minimizes loss whatever loss we are using this loss could be binary cross entropy if you are doing multi-class classification it could be squared loss if you're doing regression whatever loss you are using for this model to that loss you just have this hyper parameter lambda now here is the fun part what do we want we want w dot w transpose minus i equals to 0 similarly we want w transpose w minus i is should also be equal to 0. now remember that this is a matrix and this is a matrix right we want so here when i say 0 it means a 0 matrix we want this matrix to be completely zeros k rows and k columns of zeros similarly we want this we want this matrix also to be a matrix of zeros when i say zero here it doesn't mean a scalar it means a vector here right it should all be zeros all be zeros now how do i again remember the loss is a scalar value i can't just place w dot w transpose minus i in the optimization problem so what i'll do now is i'll take w dot w transpose minus i i'll take the frobenius long now what is the frobenius norm of this what is the square of the frobenius norm of this what is the frobenius norm of a matrix it's nothing but take each element take the square of the element and sum them so this is basically whatever is the resultant you get suppose you get some matrix here right suppose you get some matrix here take each element take each element square it up and sum it up take the square of this plus sum of plus square of this plus square of this plus square of this plus square of this so on so forth right so that's what a frobenius norm is and again we discussed about frobenius norms when we learned about matrix factorization it's the same concept here right so what do we want to enforce that w dot w transpose minus i should be a zero matrix we will say w dot w transpose minus i the frobenius norm of it again we want this frobenius norm to be equal to zero right similarly we want w transpose w minus i also should be equals to a zero matrix which means the frobenius norm squared of this should also be equals to 0. now look at this we have a hyperparameter lambda here if you increase this hyperparameter lambda right if we increase this hyperparameter lambda what happens it will ensure again this is a hyper parameter that will tune right so when we are when we are when we are solving this optimization problem if you have a high value of lambda what happens here this will be forced to zero and this will also be forced to zero so by by playing around with this hyper parameter lambda you can minimize the loss while also enforcing that this is zero or very close to zero and similarly this is very close to zero right and this is one way you can add new constraints again what what are we asking here in the problem definition all we are asking is how do you add a new constraint on this weight matrix between two layers and the simplest way to add a constraint like this is by basically is basically changing the optimization problem that you're trying to minimize right again this is a problem which involves some basic linear algebra again all that that we have done here is basic linear algebra again everything that we've done here is also basic linear algebra so we are mixing the concepts of basic linear algebra and concepts like frobenius norm that we've learnt in matrix factorization to the standard problem of loss minimization that we anyway solve in most machine learning algorithms including a deep learning algorithm righthere is a very interesting interview problem from deep learning and this problem involves deep learning and some linear algebra right so first let me explain the problem and i would like you to think about the solution and then i will go ahead and solve it for you so imagine you have a neural network let's assume it's a standard fully connected right so dense network right so you have some inputs and then let's assume there are two layers of neurons here let's assume this is layer i and this is layer i plus 1 obviously there are some layers before li and there are some layers after li plus 1 also okay i'm just i'm just let's assume there are two layers let's say you meet standard feed forward fully connected densely connected neural network so imagine if there are k activation units here and there are k activation units here let's just assume for simplicity right then there is a weight matrix w here which is a k cross k matrix right so imagine imagine this is my x right goes through multiple layers again one of the layers here has this weight matrix w and then there are multiple layers and then there is loss here obviously right so that i mean there are more layers here also we are just showing with respect to one layer only now in this neural network what we want this neural network to achieve is of course you will you want to minimize loss just like any other neural network but in addition to it we want this w matrix which is basically the matrix of all the connection weights between layer i and layer i plus 1 we want this w matrix to have some properties so look at this w this w has let's assume some columns column 1 column 2 column 3 so on so forth and a bunch of rows right since this is a k cross k matrix it will have k rows and k columns what we want this square matrix this is a square matrix right we want the square matrix to be orthogonal which means we want each of these rows so we want any if you pick any two rows here we want these rows any pair of rows to be perpendicular to each other or orthogonal to each other so we want ci to be perpendicular to cj where i is not equals to j and we want each column vector to be a unit vector similarly we want every row if you pick any two row vectors here right if you pick any two row vectors here we want ri to be perpendicular or orthogonal to r j when i is not equals to j obviously when i is not equals to j and we want each of the row vectors to be also unit vectors right so this is the constraint that we want we want this square matrix w which is the connection which is the which is a matrix of all connections between layer i and layer i plus 1. we want this to be an orthogonal matrix such that every pair of columns are orthogonal and unit vectors and every pair of rows are orthogonal and unit vectors this is the constraint we want now the question here is now how do you train this neural network of course in this neural network you would want to minimize your loss but in addition to minimizing loss you want to satisfy this constraint or this requirement that this w should be an orthogonal matrix satisfying all these constraints now how do you train such a such a model how do you so when you train a model how do you ensure that w follows these requirements that we are asking of it while also minimizing loss okay so i i hope this question is clear so please pause this video here and try to think of a solution on how you will train this neural network to ensure that you are minimizing the loss as well as ensuring that w satisfies the constraints that we want so i'm going to assume that you've put in some time so let's just use some basic linear algebra and then we'll change the optimization problem that we solve when when we're trying to when we're trying to find the weights of a neural network right so we'll change the optimization problem itself cool so first and foremost what is given to us that first let's look at it from the perspective of the rows of w right we want if we want wi to be perpendicular to w j let's assume w i is the i throw okay this is the notation that i am going to use when i say wi it means the i throw when i say w j it is the j throw right so we want wi to be perpendicular to w j whenever i is not equals to j right what does it what does that mean we what does it mean intuitively from a linear algebraic standpoint it means we want the dot product between w i and w j to be 0 whenever i is not equals to j very simple but at the same time we want each wi to also be a unit vector if you look here we want each row to be perpendicular to other rows but each row we want it to be a unit vector also which means wi dot wi we want it to be equal to 1 right this is the definition which satisfies that w i is a unit vector right so we want wi's to satisfy these two requirements right so if if you try to write this in one formula what you get here is w i dot w j the dot product should be equals to 1 if i equals to j or 0 otherwise right so this is what we want to satisfy now this is the vector notation right so wi is a vector wj is a vector here now if you write it in matrix form let's take w w is the matrix that we construct using w i's as rows right so this is what you have this is your w1 this is your w2 right so on so forth your wk right that's a matrix that you have now if if you do a dot product right if you do a product of w with w transpose this is your w so w transpose will have w 1 as column right what will w transpose look like it will have w1 as the column w2 as the second column so on so forth wk as the as the kth column now when you do this product what do you want so if you want to take this take this requirement into consideration w 1 into w 1 you want it to be 1. similarly w 2 into w 2 you want it to be 1 w 3 into w 3 you want it to be 1 and so on and so forth but w 1 into w 2 you want it to be 0 so what you'll construct here if you do a matrix multiplication between w and w transpose what you want to achieve is basically all the diagonal elements to be 1 all the non-diagonal elements to be 0 and such a matrix is called as an identity matrix right all the diagonal elements are 1 all the non-diagonal elements are 0. so what do we want we want w dot w transpose to be equals to the identity matrix of course this identity matrix is a k cross k identity matrix right so this the requirements that are being asked of us for the rows if you want each ri to be perpendicular to r j and if you want each ri to be unit vector that is equal to saying that we want w dot w transpose equals to i this is what we want to achieve again another way to write this is we want w w transpose minus i equals to 0 this is what we want right similarly just like the way we have these requirements for rows we have the requirements for columns also so now now look at this what are columns of w columns of w are nothing but rows of w transpose if you think about it if i just create a new matrix called w transpose right columns of w are nothing but rows of w transpose right so whatever logic we have constructed here same logic we can construct but instead of using w we'll use the matrix w transpose right so now what is w transpose i w transpose i is the i throw of w transpose right which is nothing but the eighth column of w look at this w transpose i here is the ith row of w transpose either of w transpose is nothing but the ith column of w similarly we want w transpose i dot product w transpose j should be equals to 0 when i is not equals to j and w transpose i dot w transpose i should be equals to 1 same logic same logic that we have applied for rows we are now just applying for columns and mathematically the columns of w are same as the rows of w transpose which means again using the same logic that we have applied here if you just apply it on columns what you want now is we want this matrix w transpose multiplied by matrix w to be equal to the identity matrix k cross k so what do we want this this requirement that we have on columns can be written as we want w transpose multiplied by w minus i should be equals to 0 right so look at look at the two constraints we have this is w dot w transpose minus i equals to 0 this is one requirement the other requirement here is w transpose w minus i also should be equals to zero right so when we try to fit the weights right when we try to find when we when we use back propagation as part of deep learning to estimate these weights right by minimizing the loss and by performing the back propagation by using the back propagation algorithm we want w to satisfy this requirement and we want w to also satisfy this now how do we now force your back propagation algorithm or your optimization problem to achieve this it's very simple look at this let's see you theta is all the parameters that you have in your model right so t remember w is just if you notice this if you notice this model w is just the weights between two layers it is basically the weights between these two layers only right all the weights all the weights that you have here but there will be other weights and other parameters that you have in layers before this and layers after that right let's just assume theta is all the parameters that you have so theta is all the parameters that you have in your deep learning model obviously w is part of theta right so what are we trying to minimize we are trying to find the optimal theta or in other words we are trying to find the optimal parameters of the model such that it minimizes loss whatever loss we are using this loss could be binary cross entropy if you are doing multi-class classification it could be squared loss if you're doing regression whatever loss you are using for this model to that loss you just have this hyper parameter lambda now here is the fun part what do we want we want w dot w transpose minus i equals to 0 similarly we want w transpose w minus i is should also be equal to 0. now remember that this is a matrix and this is a matrix right we want so here when i say 0 it means a 0 matrix we want this matrix to be completely zeros k rows and k columns of zeros similarly we want this we want this matrix also to be a matrix of zeros when i say zero here it doesn't mean a scalar it means a vector here right it should all be zeros all be zeros now how do i again remember the loss is a scalar value i can't just place w dot w transpose minus i in the optimization problem so what i'll do now is i'll take w dot w transpose minus i i'll take the frobenius long now what is the frobenius norm of this what is the square of the frobenius norm of this what is the frobenius norm of a matrix it's nothing but take each element take the square of the element and sum them so this is basically whatever is the resultant you get suppose you get some matrix here right suppose you get some matrix here take each element take each element square it up and sum it up take the square of this plus sum of plus square of this plus square of this plus square of this plus square of this so on so forth right so that's what a frobenius norm is and again we discussed about frobenius norms when we learned about matrix factorization it's the same concept here right so what do we want to enforce that w dot w transpose minus i should be a zero matrix we will say w dot w transpose minus i the frobenius norm of it again we want this frobenius norm to be equal to zero right similarly we want w transpose w minus i also should be equals to a zero matrix which means the frobenius norm squared of this should also be equals to 0. now look at this we have a hyperparameter lambda here if you increase this hyperparameter lambda right if we increase this hyperparameter lambda what happens it will ensure again this is a hyper parameter that will tune right so when we are when we are when we are solving this optimization problem if you have a high value of lambda what happens here this will be forced to zero and this will also be forced to zero so by by playing around with this hyper parameter lambda you can minimize the loss while also enforcing that this is zero or very close to zero and similarly this is very close to zero right and this is one way you can add new constraints again what what are we asking here in the problem definition all we are asking is how do you add a new constraint on this weight matrix between two layers and the simplest way to add a constraint like this is by basically is basically changing the optimization problem that you're trying to minimize right again this is a problem which involves some basic linear algebra again all that that we have done here is basic linear algebra again everything that we've done here is also basic linear algebra so we are mixing the concepts of basic linear algebra and concepts like frobenius norm that we've learnt in matrix factorization to the standard problem of loss minimization that we anyway solve in most machine learning algorithms including a deep learning algorithm righthere is a very interesting interview problem from deep learning and this problem involves deep learning and some linear algebra right so first let me explain the problem and i would like you to think about the solution and then i will go ahead and solve it for you so imagine you have a neural network let's assume it's a standard fully connected right so dense network right so you have some inputs and then let's assume there are two layers of neurons here let's assume this is layer i and this is layer i plus 1 obviously there are some layers before li and there are some layers after li plus 1 also okay i'm just i'm just let's assume there are two layers let's say you meet standard feed forward fully connected densely connected neural network so imagine if there are k activation units here and there are k activation units here let's just assume for simplicity right then there is a weight matrix w here which is a k cross k matrix right so imagine imagine this is my x right goes through multiple layers again one of the layers here has this weight matrix w and then there are multiple layers and then there is loss here obviously right so that i mean there are more layers here also we are just showing with respect to one layer only now in this neural network what we want this neural network to achieve is of course you will you want to minimize loss just like any other neural network but in addition to it we want this w matrix which is basically the matrix of all the connection weights between layer i and layer i plus 1 we want this w matrix to have some properties so look at this w this w has let's assume some columns column 1 column 2 column 3 so on so forth and a bunch of rows right since this is a k cross k matrix it will have k rows and k columns what we want this square matrix this is a square matrix right we want the square matrix to be orthogonal which means we want each of these rows so we want any if you pick any two rows here we want these rows any pair of rows to be perpendicular to each other or orthogonal to each other so we want ci to be perpendicular to cj where i is not equals to j and we want each column vector to be a unit vector similarly we want every row if you pick any two row vectors here right if you pick any two row vectors here we want ri to be perpendicular or orthogonal to r j when i is not equals to j obviously when i is not equals to j and we want each of the row vectors to be also unit vectors right so this is the constraint that we want we want this square matrix w which is the connection which is the which is a matrix of all connections between layer i and layer i plus 1. we want this to be an orthogonal matrix such that every pair of columns are orthogonal and unit vectors and every pair of rows are orthogonal and unit vectors this is the constraint we want now the question here is now how do you train this neural network of course in this neural network you would want to minimize your loss but in addition to minimizing loss you want to satisfy this constraint or this requirement that this w should be an orthogonal matrix satisfying all these constraints now how do you train such a such a model how do you so when you train a model how do you ensure that w follows these requirements that we are asking of it while also minimizing loss okay so i i hope this question is clear so please pause this video here and try to think of a solution on how you will train this neural network to ensure that you are minimizing the loss as well as ensuring that w satisfies the constraints that we want so i'm going to assume that you've put in some time so let's just use some basic linear algebra and then we'll change the optimization problem that we solve when when we're trying to when we're trying to find the weights of a neural network right so we'll change the optimization problem itself cool so first and foremost what is given to us that first let's look at it from the perspective of the rows of w right we want if we want wi to be perpendicular to w j let's assume w i is the i throw okay this is the notation that i am going to use when i say wi it means the i throw when i say w j it is the j throw right so we want wi to be perpendicular to w j whenever i is not equals to j right what does it what does that mean we what does it mean intuitively from a linear algebraic standpoint it means we want the dot product between w i and w j to be 0 whenever i is not equals to j very simple but at the same time we want each wi to also be a unit vector if you look here we want each row to be perpendicular to other rows but each row we want it to be a unit vector also which means wi dot wi we want it to be equal to 1 right this is the definition which satisfies that w i is a unit vector right so we want wi's to satisfy these two requirements right so if if you try to write this in one formula what you get here is w i dot w j the dot product should be equals to 1 if i equals to j or 0 otherwise right so this is what we want to satisfy now this is the vector notation right so wi is a vector wj is a vector here now if you write it in matrix form let's take w w is the matrix that we construct using w i's as rows right so this is what you have this is your w1 this is your w2 right so on so forth your wk right that's a matrix that you have now if if you do a dot product right if you do a product of w with w transpose this is your w so w transpose will have w 1 as column right what will w transpose look like it will have w1 as the column w2 as the second column so on so forth wk as the as the kth column now when you do this product what do you want so if you want to take this take this requirement into consideration w 1 into w 1 you want it to be 1. similarly w 2 into w 2 you want it to be 1 w 3 into w 3 you want it to be 1 and so on and so forth but w 1 into w 2 you want it to be 0 so what you'll construct here if you do a matrix multiplication between w and w transpose what you want to achieve is basically all the diagonal elements to be 1 all the non-diagonal elements to be 0 and such a matrix is called as an identity matrix right all the diagonal elements are 1 all the non-diagonal elements are 0. so what do we want we want w dot w transpose to be equals to the identity matrix of course this identity matrix is a k cross k identity matrix right so this the requirements that are being asked of us for the rows if you want each ri to be perpendicular to r j and if you want each ri to be unit vector that is equal to saying that we want w dot w transpose equals to i this is what we want to achieve again another way to write this is we want w w transpose minus i equals to 0 this is what we want right similarly just like the way we have these requirements for rows we have the requirements for columns also so now now look at this what are columns of w columns of w are nothing but rows of w transpose if you think about it if i just create a new matrix called w transpose right columns of w are nothing but rows of w transpose right so whatever logic we have constructed here same logic we can construct but instead of using w we'll use the matrix w transpose right so now what is w transpose i w transpose i is the i throw of w transpose right which is nothing but the eighth column of w look at this w transpose i here is the ith row of w transpose either of w transpose is nothing but the ith column of w similarly we want w transpose i dot product w transpose j should be equals to 0 when i is not equals to j and w transpose i dot w transpose i should be equals to 1 same logic same logic that we have applied for rows we are now just applying for columns and mathematically the columns of w are same as the rows of w transpose which means again using the same logic that we have applied here if you just apply it on columns what you want now is we want this matrix w transpose multiplied by matrix w to be equal to the identity matrix k cross k so what do we want this this requirement that we have on columns can be written as we want w transpose multiplied by w minus i should be equals to 0 right so look at look at the two constraints we have this is w dot w transpose minus i equals to 0 this is one requirement the other requirement here is w transpose w minus i also should be equals to zero right so when we try to fit the weights right when we try to find when we when we use back propagation as part of deep learning to estimate these weights right by minimizing the loss and by performing the back propagation by using the back propagation algorithm we want w to satisfy this requirement and we want w to also satisfy this now how do we now force your back propagation algorithm or your optimization problem to achieve this it's very simple look at this let's see you theta is all the parameters that you have in your model right so t remember w is just if you notice this if you notice this model w is just the weights between two layers it is basically the weights between these two layers only right all the weights all the weights that you have here but there will be other weights and other parameters that you have in layers before this and layers after that right let's just assume theta is all the parameters that you have so theta is all the parameters that you have in your deep learning model obviously w is part of theta right so what are we trying to minimize we are trying to find the optimal theta or in other words we are trying to find the optimal parameters of the model such that it minimizes loss whatever loss we are using this loss could be binary cross entropy if you are doing multi-class classification it could be squared loss if you're doing regression whatever loss you are using for this model to that loss you just have this hyper parameter lambda now here is the fun part what do we want we want w dot w transpose minus i equals to 0 similarly we want w transpose w minus i is should also be equal to 0. now remember that this is a matrix and this is a matrix right we want so here when i say 0 it means a 0 matrix we want this matrix to be completely zeros k rows and k columns of zeros similarly we want this we want this matrix also to be a matrix of zeros when i say zero here it doesn't mean a scalar it means a vector here right it should all be zeros all be zeros now how do i again remember the loss is a scalar value i can't just place w dot w transpose minus i in the optimization problem so what i'll do now is i'll take w dot w transpose minus i i'll take the frobenius long now what is the frobenius norm of this what is the square of the frobenius norm of this what is the frobenius norm of a matrix it's nothing but take each element take the square of the element and sum them so this is basically whatever is the resultant you get suppose you get some matrix here right suppose you get some matrix here take each element take each element square it up and sum it up take the square of this plus sum of plus square of this plus square of this plus square of this plus square of this so on so forth right so that's what a frobenius norm is and again we discussed about frobenius norms when we learned about matrix factorization it's the same concept here right so what do we want to enforce that w dot w transpose minus i should be a zero matrix we will say w dot w transpose minus i the frobenius norm of it again we want this frobenius norm to be equal to zero right similarly we want w transpose w minus i also should be equals to a zero matrix which means the frobenius norm squared of this should also be equals to 0. now look at this we have a hyperparameter lambda here if you increase this hyperparameter lambda right if we increase this hyperparameter lambda what happens it will ensure again this is a hyper parameter that will tune right so when we are when we are when we are solving this optimization problem if you have a high value of lambda what happens here this will be forced to zero and this will also be forced to zero so by by playing around with this hyper parameter lambda you can minimize the loss while also enforcing that this is zero or very close to zero and similarly this is very close to zero right and this is one way you can add new constraints again what what are we asking here in the problem definition all we are asking is how do you add a new constraint on this weight matrix between two layers and the simplest way to add a constraint like this is by basically is basically changing the optimization problem that you're trying to minimize right again this is a problem which involves some basic linear algebra again all that that we have done here is basic linear algebra again everything that we've done here is also basic linear algebra so we are mixing the concepts of basic linear algebra and concepts like frobenius norm that we've learnt in matrix factorization to the standard problem of loss minimization that we anyway solve in most machine learning algorithms including a deep learning algorithm right\n"