C4W1L05 Strided Convolutions

**Understanding Convolutional Neural Networks: A Deep Dive**

In this article, we will delve into the world of convolutional neural networks (CNNs) and explore how they use convolutional operations to process data. We will start by defining what a convolution is and how it works.

**What is a Convolution?**

A convolution is an operation that involves sliding a small matrix over a larger matrix, performing element-wise multiplication at each position of the smaller matrix. This results in a new matrix where each element is the sum of the products of corresponding elements from the smaller matrix and its surrounding elements in the larger matrix.

**Notation for Convolution**

The notation used to denote convolution can vary depending on the context. However, one common notation is to use a 4 for the size of the output. For example, if we have an N x N image and we convolve it with an F x F filter, the output will have a size of (N-2F+1) x (N-2F+1). This notation denotes the floor of Z, which means taking Z and rounding down to the nearest integer.

**Rounding Down**

When performing convolution, there are cases where the output does not result in an integer value. In these cases, we round down to the nearest integer. For example, if we have a 5 x 5 image and we convolve it with a 3 x 3 filter, the output size will be (5-2*3+1) x (5-2*3+1), which equals 4 x 4.

**Cross-Correlation vs. Convolution**

There is a technical comment worth mentioning regarding cross-correlation versus convolution. In some signal processing texts, convolution is defined as taking the filter and mirroring it both horizontally and vertically, then performing the element-wise multiplication. However, in the context of deep learning, we typically do not use this mirroring operation, and instead define the convolution operation without flipping the filter. This is because, for most deep neural networks, the order of operations does not matter, and omitting the mirroring step simplifies the code.

**Convolutional Operations**

To perform a convolutional operation, you need to follow these steps:

* Take the input image and split it into smaller matrices.

* Take the filter matrix and flip it horizontally and vertically (if necessary).

* Perform the element-wise multiplication between the filter and each position in the input image.

* Sum up the products at each position.

* Repeat this process for each position in the output.

**Advantages of Convolutional Neural Networks**

Convolutional neural networks have several advantages over other types of neural networks. They can automatically detect and describe local patterns in data, such as edges and textures. This makes them particularly useful for image recognition tasks. Additionally, convolutional neural networks can share weights across different locations in the input image, which reduces the number of parameters needed to train the network.

In the next video, we will explore how to carry out convolutions with volumes, which will make what you can do with convolutions much more powerful.

"WEBVTTKind: captionsLanguage: enstrident convolutions is another piece of the basic building block of convolutions as used in convolutional neural networks let me show you an example let's say you want to convert this seven by seven image with this D by three filter except that instead of doing in the usual way we're going to do it with a stride of two what that means is you take the element wise product as usual in this upper left V by three region and then multiply an ad and that gives you ninety one but then instead of stepping the blue box over by one step we're going to step it over by two steps so we're gonna make it hop over two steps like so notice how the upper left-hand corner has gone from dis dot to dis stopped jumping over one position and then you do the usual element wise product and summing and that gives you it turns out one hundred and now we're going to do that again and make the blue box jump over by two steps so you end up there and that gives you eighty three and now when you go to the NICS row you again actually take two steps instead of one step so going to move the blue box over there notice how we're skipping over one of the positions and then just gives you 69 and now you gain step over two steps this gives you 91 and so on this 127 and then for the final row 44 72 and 74 so in this example we convolve with a 7 by 7 matrix with a 3 by 3 matrix and we get a 3 by 3 output so the input and output dimensions turns out to be governed by the following formula if you have an N by n image they involved with an F by F filter and if you use padding P and stride s so in this example s is equal to 2 then you end up with an output that is n plus 2 P minus F and now because you're stepping s steps at the time this of just one step at a time you know you you now divided by s plus one and then by the same thing so in our example we have 7 plus 0 minus 3 divided by 2 let's try plus 1 equals let's see that's 4 over 2 plus 1 equals 3 which is why we wound up with this V by 3 Oakland now just one last detail which is 1 of this fraction is not an integer in that case we're going to round this down so this notation denotes the 4 of something so this right this is also called the floor of Z it means taking Z and rounding down to the nearest integer and if the way this is implemented is that you take this type of blue box multiplication the only of the blue box is fully contained within the image or the image posted having and if any of this blue box kind of part of it hangs outside and you just do not do that computation then it turns out that if that's a convention that you're you know 3 by 3 filter must lie entirely within your image or the image plus the having region before there's a corresponding output generator if that's convention then the right thing to do for the to computing output dimension is to round down in case this n plus 2 P minus F over s is not an integer so just to summarize the dimensions if you have an N by n matrix or n by n image that you convolved with an F by F matrix and that by a filter repair a phys then the output size will have this dimension and you know it is nice we can choose all of these numbers so that that isn't integer although sometimes you don't have to do that and rounding down this is just fine as well but please feel free to work through a few examples of values of n F P and s for yourself to convince yourself if you want that this formula is correct for the output size now before moving on there is a technical comment I want to make about cross correlations versus the convolutions and this won't affect what you have to do to implement convolutional neural networks but depending on if you read a different math textbook or signal processing textbook there is one other possible inconsistency in the notation which is that if you look at a typical math textbook the way that the convolution is defined before doing the element wise product and summing there's actually one other step that you would first take which is to convolve this 6x6 matrix of the 3x3 filter you would first take the 3x3 filter and flip it on the horizontal as well as the vertical axis so this 3 4 5 102 - 197 would become the egos here 4 goes there 5 goes there and then the second row becomes this 102 minus 1 9 7 but this is really taking the 3 by 3 filter and mirroring it both on the vertical and the horizontal axis and then it was this flipped matrix that you would then copy over here so to compute the output you would take 2 times 7 plus 3 times 2 plus 7 times 5 and so on right and you actually multiply out the elements of this flipped matrix in order to compute the upper left-hand most element of the you know of the 4x4 output as follows and then you take those nine numbers and your ship them over by one ship them over by one and so on so the way we've defined the convolution operation in these videos is that we've skipped this mirroring operation and technically what we're actually doing really the operation we've been using for the last three videos is sometimes crowd cross-correlation instead of convolution but in the deep learning literature by convention we just call this a convolution operation so just to summarize by convention in machine learning we usually do not bother with this flipping operation and technically this operation is maybe better called cross-correlation but most of the deep learning literature just causes the convolution operator and so I'm going to use that convention in these videos as well and if you read a lot of the machine learning literature you find most people just call this the convolution operator without bothering to use these flips and it turns out that in signal processing or in certain branches of mathematics doing the flipping in the definition of convolution calls this convolution operator to enjoy this property that a convolve would be involved with C is equal to a involved would be involved with C and this is called associativity in mathematics and this is nice for some signal processing applications but for deep neural networks it really doesn't matter and so omitting this double mirroring operation just simplifies the code and mixing your own integral work just as well and by convention most of us just call this convolution even though the mathematicians prefer to call this cross correlation sometimes but this should not affect anything you have to implement in the program exercises and should not affect your ability to read and understand the deep learning literature so you've now seen how to carry out convolutions and you've seen how to use heading as well strides for convolutions but so far all we've been using is convolutions / matrices like / 6x6 matrix in the next video you see how to carry out convolutions / volumes and this will make what you can do with convolutions suddenly much more powerful let's go onto the next videostrident convolutions is another piece of the basic building block of convolutions as used in convolutional neural networks let me show you an example let's say you want to convert this seven by seven image with this D by three filter except that instead of doing in the usual way we're going to do it with a stride of two what that means is you take the element wise product as usual in this upper left V by three region and then multiply an ad and that gives you ninety one but then instead of stepping the blue box over by one step we're going to step it over by two steps so we're gonna make it hop over two steps like so notice how the upper left-hand corner has gone from dis dot to dis stopped jumping over one position and then you do the usual element wise product and summing and that gives you it turns out one hundred and now we're going to do that again and make the blue box jump over by two steps so you end up there and that gives you eighty three and now when you go to the NICS row you again actually take two steps instead of one step so going to move the blue box over there notice how we're skipping over one of the positions and then just gives you 69 and now you gain step over two steps this gives you 91 and so on this 127 and then for the final row 44 72 and 74 so in this example we convolve with a 7 by 7 matrix with a 3 by 3 matrix and we get a 3 by 3 output so the input and output dimensions turns out to be governed by the following formula if you have an N by n image they involved with an F by F filter and if you use padding P and stride s so in this example s is equal to 2 then you end up with an output that is n plus 2 P minus F and now because you're stepping s steps at the time this of just one step at a time you know you you now divided by s plus one and then by the same thing so in our example we have 7 plus 0 minus 3 divided by 2 let's try plus 1 equals let's see that's 4 over 2 plus 1 equals 3 which is why we wound up with this V by 3 Oakland now just one last detail which is 1 of this fraction is not an integer in that case we're going to round this down so this notation denotes the 4 of something so this right this is also called the floor of Z it means taking Z and rounding down to the nearest integer and if the way this is implemented is that you take this type of blue box multiplication the only of the blue box is fully contained within the image or the image posted having and if any of this blue box kind of part of it hangs outside and you just do not do that computation then it turns out that if that's a convention that you're you know 3 by 3 filter must lie entirely within your image or the image plus the having region before there's a corresponding output generator if that's convention then the right thing to do for the to computing output dimension is to round down in case this n plus 2 P minus F over s is not an integer so just to summarize the dimensions if you have an N by n matrix or n by n image that you convolved with an F by F matrix and that by a filter repair a phys then the output size will have this dimension and you know it is nice we can choose all of these numbers so that that isn't integer although sometimes you don't have to do that and rounding down this is just fine as well but please feel free to work through a few examples of values of n F P and s for yourself to convince yourself if you want that this formula is correct for the output size now before moving on there is a technical comment I want to make about cross correlations versus the convolutions and this won't affect what you have to do to implement convolutional neural networks but depending on if you read a different math textbook or signal processing textbook there is one other possible inconsistency in the notation which is that if you look at a typical math textbook the way that the convolution is defined before doing the element wise product and summing there's actually one other step that you would first take which is to convolve this 6x6 matrix of the 3x3 filter you would first take the 3x3 filter and flip it on the horizontal as well as the vertical axis so this 3 4 5 102 - 197 would become the egos here 4 goes there 5 goes there and then the second row becomes this 102 minus 1 9 7 but this is really taking the 3 by 3 filter and mirroring it both on the vertical and the horizontal axis and then it was this flipped matrix that you would then copy over here so to compute the output you would take 2 times 7 plus 3 times 2 plus 7 times 5 and so on right and you actually multiply out the elements of this flipped matrix in order to compute the upper left-hand most element of the you know of the 4x4 output as follows and then you take those nine numbers and your ship them over by one ship them over by one and so on so the way we've defined the convolution operation in these videos is that we've skipped this mirroring operation and technically what we're actually doing really the operation we've been using for the last three videos is sometimes crowd cross-correlation instead of convolution but in the deep learning literature by convention we just call this a convolution operation so just to summarize by convention in machine learning we usually do not bother with this flipping operation and technically this operation is maybe better called cross-correlation but most of the deep learning literature just causes the convolution operator and so I'm going to use that convention in these videos as well and if you read a lot of the machine learning literature you find most people just call this the convolution operator without bothering to use these flips and it turns out that in signal processing or in certain branches of mathematics doing the flipping in the definition of convolution calls this convolution operator to enjoy this property that a convolve would be involved with C is equal to a involved would be involved with C and this is called associativity in mathematics and this is nice for some signal processing applications but for deep neural networks it really doesn't matter and so omitting this double mirroring operation just simplifies the code and mixing your own integral work just as well and by convention most of us just call this convolution even though the mathematicians prefer to call this cross correlation sometimes but this should not affect anything you have to implement in the program exercises and should not affect your ability to read and understand the deep learning literature so you've now seen how to carry out convolutions and you've seen how to use heading as well strides for convolutions but so far all we've been using is convolutions / matrices like / 6x6 matrix in the next video you see how to carry out convolutions / volumes and this will make what you can do with convolutions suddenly much more powerful let's go onto the next video\n"