The Convolution Operator: A Fundamental Building Block in Computer Vision
One of the fundamental operations used in computer vision is the convolution operator, which is a mathematical operation that combines two matrices by sliding one matrix over another and calculating the dot product at each position. The convolution operator is a crucial component in various image processing techniques, including edge detection, image filtering, and feature extraction.
To illustrate how the convolution operator works, let's consider an example of convolving a 3x3 image with a 3x3 filter. Suppose we have an image represented as a 6x6 matrix, where the left half of the image is 10 and the right half is 0. We can represent this image as follows:
```
10 10 10 10 10 10
10 10 10 10 10 10
10 10 10 10 10 10
10 10 10 10 10 10
10 10 10 10 10 10
10 10 10 10 10 10
```
Now, let's apply the convolution operator to this image using a 3x3 filter. The filter can be visualized as follows:
```
-1 -1 -1
-1 5 -1
-1 -1 -1
```
To perform the convolution operation, we slide the filter over the image, calculating the dot product at each position. For example, to calculate the output of the leftmost column, we multiply the elements of the first row of the filter (5) with the corresponding elements of the input image (10, 10, and 10), resulting in a value of 50.
The convolution operation can be performed element-wise, meaning that each element of the output matrix is calculated by taking the dot product of the corresponding elements of the input matrix and the filter. This process is repeated for each position in the image, resulting in a new matrix with the same dimensions as the original image.
Now, let's look at how this operation affects the image. The convolution operator can be used to detect edges in an image by applying it to the image using a filter that responds to edges. For example, if we use a 3x3 filter with only one non-zero element (which is the center element of the filter) and set all other elements of the filter to zero, we get:
```
0 0 0
0 1 0
0 0 0
```
This filter will detect edges that are oriented in a specific direction. When we apply this filter to our image using convolution, we get:
```
0 0 0
10 5 10
0 0 0
```
The resulting matrix shows the output of the convolution operation, which represents the detection of a vertical edge in the input image.
The convolution operator has several advantages that make it a popular choice in computer vision. One of its key benefits is that it allows us to specify how to extract features from an image by applying different filters. This makes it easy to design algorithms for tasks such as edge detection, image filtering, and feature extraction. Additionally, the convolution operator can be used to implement various image processing techniques, including blurring, sharpening, and smoothing.
In practice, programming languages often have built-in functions or libraries that support the implementation of convolution operators. For example, in MATLAB, you would use a function called conv2() to perform 2D convolution. Similarly, in Python with libraries like NumPy and SciPy, you can use the convolve2d() function to implement convolution.
In conclusion, the convolution operator is a fundamental building block in computer vision that allows us to extract features from images by applying different filters. Its ease of implementation and flexibility make it a popular choice for tasks such as edge detection, image filtering, and feature extraction.
"WEBVTTKind: captionsLanguage: enthe convolution operation is one of the fundamental building blocks of a convolutional neural network using edge detection as the motivating example in this video you see how the convolution operation works in previous videos have talked about how the early layers of the neural network might detect edges and then the summer later layers might detect positive objects and then even later layers may be to take parts of complete objects like people's faces in this case in this video you see how you can detect edges in an image let's take an example given a picture like that for a computer to figure out what are the objects in this picture the first thing you might do is maybe detect vertical edges in this image so for example this image has all those vertical lines where the railings are as was you know kind of vertical lines idea online movies pedestrians and so those get detected in this vertical edge detector output and you might also want to detect horizontal edges so for example there's a very strong horizontal line where this railing is and that also gets detected sort of roughly here so how do you detect edges in an image like this let's look at an example here is a six by six grayscale image and because this is a grayscale image this is just a six by six by one matrix rather than six by six by three because they're on separate RGB channels in order to detect edges or let's say vertical edges in this image what you can do is construct a three by three matrix and in the pollens when the terminology of convolutional neural networks this is going to be called a filter and I'm going to construct a three by three filter or feareth II matrix that looks like this 1 1 1 0 0 0 minus 1 minus 1 minus 1 sometimes research papers would call this a colonel instead of a filter but I'm going to use the filter terminology in these videos and what you're going to do is take the 6x6 image and convolve it and the convolution operation is denoted by this asterisk in convolve it with the 3x3 filter one slightly unfortunate thing about the notation is that in mathematics the asterisk is the standard symbol for convolution but in Python this is also used to denote multiplication or maybe element-wise multiplication so this asteroids has dual purposes it's overloaded notation but I'll try to be clear in these videos when this asterisks refers to convolution and the output of this convolution operator will be a four by four matrix which you can interpret that which you can think of as a four by four image and the way you compute this four by four output is as follows to compute the first element the upper-left element of this four by four matrix we're going to do is take the three by three filter and paste it on top of the three by three region of your original input image so I've written here 1 1 1 0 0 0 minus 1 minus 1 minus 1 and what you should do is take the element wise product so the first one would be 3 times 1 and then the second one would be 1 times 1 I'm going down here 1 times 1 and then plus 2 times 1 this one and then add up all of the resulting 9 numbers so then the middle column gives you 0 plus times 0 plus 5 times 0 4 7 times 0 and then the rightmost column gives 1 times minus 1 8 times minus 1 plus 2 times minus 1 and adding up these 9 numbers will give you negative 5 and so I'm going to fill in negative 5 over here ok and you can add these nine numbers in any order of course it's just that I went down the first roll dinner it's just I went down the first column then the second column then the third next to figure out what is this second element you're gonna take the blue square and shift it one step to the right like so and let me get rid of the green moths here and you're going to do the same element wise product and then addition so you have 0 times 1 plus 5 times 1 plus 7 times 1 plus 1 times 0 plus 8 times 0 plus 2 times 0 plus 2 times negative 1 plus 9 times negative 1 plus 5 times negative 1 and if you add up those 9 numbers you end up with negative 4 and so on if you ship this to the right do the 9 products and add them up you get 0 and then over here you should get faint right then just to verify you have 2 plus 9 plus 5 that's 16 then the middle column gives you 0 and then the rightmost column 4 plus 1 plus 3 times negative 1 that's minus 8 so that's 16 on the left column minus 8 and that gives you ain't right like we have over here next in order to get you this element in the next row what you do is take the blue square and now shift it one down so you now have it in that position and again repeat the element-wise product and then adding exercise and if you do that you should get negative 10 here and they'd be shifted one to the right you should get negative 2 and then 2 and then 3 and so on um - then throw in all the rest of the elements of the matrix and so to be clear this minus 16 would be obtained by you know from this lower right feedback through region so a 6x6 matrix involved with a 3x3 matrix gives you a 4x4 matrix and these are images and filters these are really just matrices of various dimensions but the matrix on the left is convenient to interpret as a image and the one in the middle we interpret as a filter and the one on the right you can interpret that this may be another image and this turns out to be a vertical edge detector and you see why on the next line before going on though just one other comment which is that if you implement this in a programming language then in practice most premier languages will have some different function rather than an asterisk to denote convolution so for example in the program exercise you use or you implement a function called cond forward if you do this intensive flow there's a function T f dot n n dot conf 2d you know and then other people earning programming frameworks in the carrots purring framework which you see later in this course there's a function called conf 2d to implement convolution and so on but all the deep learning frameworks that have good support for computer vision will have you know some function for implementing this convolution operator so why is this doing vertical edge detection let's look at another example so the illustrator is we're going to use a simplified image so here is a simple six by six image where the left half of the image is 10 and the right half is 0 if you're caught this is a picture it might look like this where the left half the tens give you brighter pixel intensity values and the right half gives you darker pixel intensity values I'm using that shade of gray to denote zeros although maybe it could also be a draw this black but in this image there's clearly a very strong vertical edge right down the middle of this image as it transitions from white to black or white to darker color so when you convolve this with the 3x3 filter and so this 3x3 filter can be visualized as follows where is lighter brighter pixels on the left and then this mid-tone zeroes in the middle and in darker on the right what you get is this matrix on the right so just to verify this NAFA you want this is 0 for example is obtained by taking the element wise products and then multiplying with this 3x3 block and so you get from the left column 10 plus 10 plus 10 and then zeros in the middle and then minus 10 minus 10 minus 10 which why you end up with 0 over here whereas in contrast if that 30 will be obtained from this which you get from having you know 10 plus 10 plus 10 and then minus 0 minus 0 which is why you end up with a 30 over there now if you plot this rightmost matrix as an image it will look like that where there's this lighter region right in the middle and that corresponds to is having detected this vertical edge down the middle of your 6x6 image and in case the dimensions here seem a little bit wrong you know that the detected age seems really thick that's only because we're working with very small images in this example and if you're using say a thousand by a thousand image rather than 6 by 6 image then you find that this you know does a pretty good job where the detecting the edges the vertical edges in your image and in this example this bright region in the middle is just the output images way of saying that it looks like there's a strong vertical edge right down the middle of the image and maybe one intuition to take away from vertical edge detection is that a vertical edge is a 3 by 3 region since we're using a 3 by 3 filter where there are bright pixels on the left and you don't care that much what's in the middle and dark pixels on the right right and the middle of the image is really in the six by six image is really where you know there could be bright pixels on the left and darker pixels on the right and that's why it thinks as a vertical edge over there and the convolution operation gives you a convenient way to specify how to find these vertical edges in an image so you've now seen how the convolution operator works in the next video you see how to take this and use it as one of the basic building blocks of a composition on your networkthe convolution operation is one of the fundamental building blocks of a convolutional neural network using edge detection as the motivating example in this video you see how the convolution operation works in previous videos have talked about how the early layers of the neural network might detect edges and then the summer later layers might detect positive objects and then even later layers may be to take parts of complete objects like people's faces in this case in this video you see how you can detect edges in an image let's take an example given a picture like that for a computer to figure out what are the objects in this picture the first thing you might do is maybe detect vertical edges in this image so for example this image has all those vertical lines where the railings are as was you know kind of vertical lines idea online movies pedestrians and so those get detected in this vertical edge detector output and you might also want to detect horizontal edges so for example there's a very strong horizontal line where this railing is and that also gets detected sort of roughly here so how do you detect edges in an image like this let's look at an example here is a six by six grayscale image and because this is a grayscale image this is just a six by six by one matrix rather than six by six by three because they're on separate RGB channels in order to detect edges or let's say vertical edges in this image what you can do is construct a three by three matrix and in the pollens when the terminology of convolutional neural networks this is going to be called a filter and I'm going to construct a three by three filter or feareth II matrix that looks like this 1 1 1 0 0 0 minus 1 minus 1 minus 1 sometimes research papers would call this a colonel instead of a filter but I'm going to use the filter terminology in these videos and what you're going to do is take the 6x6 image and convolve it and the convolution operation is denoted by this asterisk in convolve it with the 3x3 filter one slightly unfortunate thing about the notation is that in mathematics the asterisk is the standard symbol for convolution but in Python this is also used to denote multiplication or maybe element-wise multiplication so this asteroids has dual purposes it's overloaded notation but I'll try to be clear in these videos when this asterisks refers to convolution and the output of this convolution operator will be a four by four matrix which you can interpret that which you can think of as a four by four image and the way you compute this four by four output is as follows to compute the first element the upper-left element of this four by four matrix we're going to do is take the three by three filter and paste it on top of the three by three region of your original input image so I've written here 1 1 1 0 0 0 minus 1 minus 1 minus 1 and what you should do is take the element wise product so the first one would be 3 times 1 and then the second one would be 1 times 1 I'm going down here 1 times 1 and then plus 2 times 1 this one and then add up all of the resulting 9 numbers so then the middle column gives you 0 plus times 0 plus 5 times 0 4 7 times 0 and then the rightmost column gives 1 times minus 1 8 times minus 1 plus 2 times minus 1 and adding up these 9 numbers will give you negative 5 and so I'm going to fill in negative 5 over here ok and you can add these nine numbers in any order of course it's just that I went down the first roll dinner it's just I went down the first column then the second column then the third next to figure out what is this second element you're gonna take the blue square and shift it one step to the right like so and let me get rid of the green moths here and you're going to do the same element wise product and then addition so you have 0 times 1 plus 5 times 1 plus 7 times 1 plus 1 times 0 plus 8 times 0 plus 2 times 0 plus 2 times negative 1 plus 9 times negative 1 plus 5 times negative 1 and if you add up those 9 numbers you end up with negative 4 and so on if you ship this to the right do the 9 products and add them up you get 0 and then over here you should get faint right then just to verify you have 2 plus 9 plus 5 that's 16 then the middle column gives you 0 and then the rightmost column 4 plus 1 plus 3 times negative 1 that's minus 8 so that's 16 on the left column minus 8 and that gives you ain't right like we have over here next in order to get you this element in the next row what you do is take the blue square and now shift it one down so you now have it in that position and again repeat the element-wise product and then adding exercise and if you do that you should get negative 10 here and they'd be shifted one to the right you should get negative 2 and then 2 and then 3 and so on um - then throw in all the rest of the elements of the matrix and so to be clear this minus 16 would be obtained by you know from this lower right feedback through region so a 6x6 matrix involved with a 3x3 matrix gives you a 4x4 matrix and these are images and filters these are really just matrices of various dimensions but the matrix on the left is convenient to interpret as a image and the one in the middle we interpret as a filter and the one on the right you can interpret that this may be another image and this turns out to be a vertical edge detector and you see why on the next line before going on though just one other comment which is that if you implement this in a programming language then in practice most premier languages will have some different function rather than an asterisk to denote convolution so for example in the program exercise you use or you implement a function called cond forward if you do this intensive flow there's a function T f dot n n dot conf 2d you know and then other people earning programming frameworks in the carrots purring framework which you see later in this course there's a function called conf 2d to implement convolution and so on but all the deep learning frameworks that have good support for computer vision will have you know some function for implementing this convolution operator so why is this doing vertical edge detection let's look at another example so the illustrator is we're going to use a simplified image so here is a simple six by six image where the left half of the image is 10 and the right half is 0 if you're caught this is a picture it might look like this where the left half the tens give you brighter pixel intensity values and the right half gives you darker pixel intensity values I'm using that shade of gray to denote zeros although maybe it could also be a draw this black but in this image there's clearly a very strong vertical edge right down the middle of this image as it transitions from white to black or white to darker color so when you convolve this with the 3x3 filter and so this 3x3 filter can be visualized as follows where is lighter brighter pixels on the left and then this mid-tone zeroes in the middle and in darker on the right what you get is this matrix on the right so just to verify this NAFA you want this is 0 for example is obtained by taking the element wise products and then multiplying with this 3x3 block and so you get from the left column 10 plus 10 plus 10 and then zeros in the middle and then minus 10 minus 10 minus 10 which why you end up with 0 over here whereas in contrast if that 30 will be obtained from this which you get from having you know 10 plus 10 plus 10 and then minus 0 minus 0 which is why you end up with a 30 over there now if you plot this rightmost matrix as an image it will look like that where there's this lighter region right in the middle and that corresponds to is having detected this vertical edge down the middle of your 6x6 image and in case the dimensions here seem a little bit wrong you know that the detected age seems really thick that's only because we're working with very small images in this example and if you're using say a thousand by a thousand image rather than 6 by 6 image then you find that this you know does a pretty good job where the detecting the edges the vertical edges in your image and in this example this bright region in the middle is just the output images way of saying that it looks like there's a strong vertical edge right down the middle of the image and maybe one intuition to take away from vertical edge detection is that a vertical edge is a 3 by 3 region since we're using a 3 by 3 filter where there are bright pixels on the left and you don't care that much what's in the middle and dark pixels on the right right and the middle of the image is really in the six by six image is really where you know there could be bright pixels on the left and darker pixels on the right and that's why it thinks as a vertical edge over there and the convolution operation gives you a convenient way to specify how to find these vertical edges in an image so you've now seen how the convolution operator works in the next video you see how to take this and use it as one of the basic building blocks of a composition on your network\n"