C4W3L04 Convolutional Implementation Sliding Windows

Implementing Sliding Windows with Convolutional Neural Networks

To implement sliding windows, you need to first create an image of the region of interest. For example, if you want to recognize a car, you would crop out a 14 by 14 region from the image and then pass it through a convolutional neural network.

The next step is to run this 14 by 14 region through the confident one, four times in order to get four different labels. This process repeats for three more regions, each of size 14 by 14, which are positioned two pixels to the right, below, and above the first region, respectively. The goal is to get a lower right corner of the image labeled as class number 0, another label for the upper left corner, one for the upper right corner, and another for the lower right corner.

However, there's a catch: running sliding windows on this 14 by 14 by 3 image is pretty small, and the computation done by these four confidence passes of the network is highly duplicated. To overcome this, convolutional implementations of sliding windows allow these four forward passes to share a lot of computation. Instead of running four separate passes through the confidence network for each region, you can take the confidence and run it with the same parameters, same 5 by 5 by filters, and output volume.

For example, if your confidence network uses a 1 by 1 by 400 filter, you can run this same filter on a 12 by 12 by 16 output volume. This allows you to reduce the number of computations required for each region significantly. By using convolutional implementations of sliding windows, you can process large images much more efficiently.

In the case of a 28 by 28 by 3 image, implementing sliding windows with convolutional neural networks becomes even more efficient. Instead of running four separate passes through the confidence network for each 14 by 14 region, as before, you can run one pass through the same confidence network using a larger output volume.

To do this, you would use a 5 by 5 by 16 filter and run it on a 8 by 8 by 4 output volume. This allows you to combine all four regions into one for computation and share a lot of the computation in the regions of the image that are common. By using convolutional implementations of sliding windows, you can process large images much more efficiently.

Finally, this algorithm has one weakness: the position of the bounding boxes is not going to be too accurate. This means that the predicted location of the car may not be precise enough for certain applications. In the next video, we'll see how to fix this problem.

"WEBVTTKind: captionsLanguage: enin the last video you learned about the sliding windows object detection algorithm using a consonant but we thought that it was too slow in this video you learn how to implement that algorithm convolutional e let's see what this means to build up toward the convolutional implementation of sliding windows let's first see how you can turn fully connected layers in your neural network into convolutional layers we'll do that first on this slide and then the next slide we'll use the ideas from this slide to show you the convolution implementation so let's say that your object detection algorithm inputs 14 by 14 by 3 images this is quite small but just for illustrative purposes and let's say it then uses five by five filters and let's say uses 16 of them to map it from 14 by 14 by 3 to 10 by 10 by 16 and then does a 2 by 2 max pooling to reduce it to 5 a 5 by 16 then has a fully connected layer 2 connected 400 units then another fully connected layer and then finally outputs Y using a soft max unit in order to make the change we'll need to in a second I'm going to change this picture a little bit and instead I'm going to view Y as 4 numbers corresponding to the cost probabilities of the four classes that the softmax units is classifying amongst and the four classes could be pedestrian car motorcycle and background or something else now what I'd like to do is show how these layers can be turned into convolutional layers so the confident with a draw same as before for the first few layers and now one way of implementing this next layer this fully connected layer is to implement this as a five by five filter and let's use four hundred five by five filters so if you take a 5 by 5 by 16 image and convolve it with a 5 by 5 filter remember a 5 by 5 filter is implemented as 5 by 5 sixteen because our convention is that the filter looks across all 16 channels so the 16 and just 16 must match and so the output will be one by one and if you have 400 of these 5 by 5 by 16 filters then the output dimension is going to be 1 by 1 by 400 and so rather than viewing these 400 as just a set of notes we're going to view this as a 1 by 1 by 400 volume and mathematically this is the same as a fully connected layer because each of these 400 notes has a filter of dimension 5 by 5 by 16 and so each of those 400 values is some you know arbitrary linear function of these 5 by 5 by 16 activations from the previous layer next to implement the next convolutional layer we're going to implement a 1 by 1 convolution and if you have 400 1 by 1 filters then with 400 filters the next layer will again be 1 by 1 by 400 so that gives you this next fully connected layer and then finally we're going to have another one by one filter followed by a softmax activation so as to give a 1 by 1 by 4 volume to take the place of these four numbers that the network was outputting so this shows how you can take these fully connect two layers and implement them using convolutional layers so that these sets of units instead are now implemented as one by one by 400 and one by one by four volumes armed of this conversion let's see how you can have a convolutional implementation of sliding windows object detection and presentation on the slide is based on the over feet paper reference at the bottom by piercer meet David Egan Xia Jiang micro Matthew Ron Ferguson Oakland let's say that your sliding windows confident inputs 14 by 14 by 3 images and again I'm just using small numbers like the 14 by 14 image in this slide may need to make the numbers and illustrations simpler so as before you have a neural network as follows that eventually outputs a 1 by 1 by 4 volume which is the output of your softmax unit and again to simplify the drawing here 14 by 14 by 3 is technically a volume 5 by 5 or 10 by 10 by 16 a second clear volume but to simplify the drawing for this slide I'm just only draw the front face of these volumes so instead of drawing you know 1 by 1 by 400 volume I'm just only draw the 1 by 1 parts of all of these right so just drop the 3d component of these drawings just for this slide so let's say that you're confident inputs 14 by 14 images or 14 by 14 by 3 images and your test set image is 16 by 16 by 3 so now added that yellow stripe to the border of this image so in the original sliding windows algorithm you might want to input the blue region into a confident and run that once to generate a classification 0 1 and then slide it down a bit let's use the stride of 2 pixels and and then you might slide oh and then you might slide that to the right by 2 pixels to input this green rectangle into the confident and rerun the whole continent and get another label 0 1 and then you might input this orange region into the confident and run it one more time to get another label and then do the fourth and final time with this your lower right now purple square and to run sliding windows on this 16 by 16 by 3 image is pretty small image you run this confident from above 4 times in order to forget 4 labels but it turns out a lot of this computation done by these 4 confidence is highly duplicated so where the convolutional implementation of sliding windows does is it allows these 4 forward passes of the confident to share a lot of computation specifically here's what you can do you can take the confidence and just run it same parameters the same 5 by 5 by filters also 16 5 by 5 filters and run it and now you can have a 12 by 12 by 16 output volume and then do the max cool same as before now you have a 6 by 6 by 16 run through your same 400 5.5 filters to get now your 2 by 2 by 40 volume so now instead of a 1 by 1 so now instead of a 1 by 1 by 400 volume you have the said a 2 by 2 by 400 volume Ramnath 301 by 1 filter gives you another 2 by 2 by 407 one by one by 400 do that one more time and now you're left with a 2 by 2 by 4 output volume is that 1 by 1 by 4 and it turns out that this blue one by one by four subset gives you the result of running in the upper left-hand corner 14 by 14 image this upper right 1 by 1 by 4 volume gives you the upper right result the lower left gives you the results of implementing the content on the lower left 14 by 14 region and the lower right 1 by 1 by 4 volume gives you the same result as running the confident on the lower right 14 by 14 media and if you step through all the steps of the calculation let's look at the green example if you had cropped out just this region and passed it through the confident through the confident on top then the first layers activations would have been exactly this region the next layers activation of the mass cooling would have been exactly this region and then the next layer the next layer would have been as follows so what this process does what this convolutional inclination does is instead of forcing you to run for a propagation on four subsets of the input image independently instead it combines all four into one for computation and shares a lot of the computation in the regions of the image that are common so all four of the 14 by 14 patches we saw here now let's just go through a bigger example let's say you now want to run sliding windows on a 28 by 28 by 3 image it turns out if you run for crop the same way then you end up with an 8 by 8 by 4 output and this corresponds to running sliding windows with that 14 by 14 region and that corresponds to running sliding windows first on that region does giving you the output corresponding on the upper left-hand corner then using Strider two to shift one window over one window over one window over and so on and there ain't position so that gives you this first row and then as you go down the image as well that gives you all of these 8 by 8 by 4 outputs and the N is because of the max pooling of 2 that this corresponds to running your neural network with a stride of 2 on the original image so just to recap to implement sliding windows previously what you do is you drop out a region let's say this is on 14 by 14 and run that to your content and do that for the next region over then do that for the next 14 by 14 region then the next one then the next one the next one the next one and so on until hopefully that one recognizes the car but now instead of doing it sequentially with this convolutional implementation that you saw in the previous slide you can implement the entire image or maybe twenty by twenty eight and convolutional 'i make all the predictions at the same time by one for pass through this big confident in we have it recognize the position of the car so that's how you implement sliding windows convolutional v and it makes the whole thing much more efficient now this algorithm store has one weakness which is the position of the bounding boxes is not going to be too accurate in the next video let's see how you can fix that problemin the last video you learned about the sliding windows object detection algorithm using a consonant but we thought that it was too slow in this video you learn how to implement that algorithm convolutional e let's see what this means to build up toward the convolutional implementation of sliding windows let's first see how you can turn fully connected layers in your neural network into convolutional layers we'll do that first on this slide and then the next slide we'll use the ideas from this slide to show you the convolution implementation so let's say that your object detection algorithm inputs 14 by 14 by 3 images this is quite small but just for illustrative purposes and let's say it then uses five by five filters and let's say uses 16 of them to map it from 14 by 14 by 3 to 10 by 10 by 16 and then does a 2 by 2 max pooling to reduce it to 5 a 5 by 16 then has a fully connected layer 2 connected 400 units then another fully connected layer and then finally outputs Y using a soft max unit in order to make the change we'll need to in a second I'm going to change this picture a little bit and instead I'm going to view Y as 4 numbers corresponding to the cost probabilities of the four classes that the softmax units is classifying amongst and the four classes could be pedestrian car motorcycle and background or something else now what I'd like to do is show how these layers can be turned into convolutional layers so the confident with a draw same as before for the first few layers and now one way of implementing this next layer this fully connected layer is to implement this as a five by five filter and let's use four hundred five by five filters so if you take a 5 by 5 by 16 image and convolve it with a 5 by 5 filter remember a 5 by 5 filter is implemented as 5 by 5 sixteen because our convention is that the filter looks across all 16 channels so the 16 and just 16 must match and so the output will be one by one and if you have 400 of these 5 by 5 by 16 filters then the output dimension is going to be 1 by 1 by 400 and so rather than viewing these 400 as just a set of notes we're going to view this as a 1 by 1 by 400 volume and mathematically this is the same as a fully connected layer because each of these 400 notes has a filter of dimension 5 by 5 by 16 and so each of those 400 values is some you know arbitrary linear function of these 5 by 5 by 16 activations from the previous layer next to implement the next convolutional layer we're going to implement a 1 by 1 convolution and if you have 400 1 by 1 filters then with 400 filters the next layer will again be 1 by 1 by 400 so that gives you this next fully connected layer and then finally we're going to have another one by one filter followed by a softmax activation so as to give a 1 by 1 by 4 volume to take the place of these four numbers that the network was outputting so this shows how you can take these fully connect two layers and implement them using convolutional layers so that these sets of units instead are now implemented as one by one by 400 and one by one by four volumes armed of this conversion let's see how you can have a convolutional implementation of sliding windows object detection and presentation on the slide is based on the over feet paper reference at the bottom by piercer meet David Egan Xia Jiang micro Matthew Ron Ferguson Oakland let's say that your sliding windows confident inputs 14 by 14 by 3 images and again I'm just using small numbers like the 14 by 14 image in this slide may need to make the numbers and illustrations simpler so as before you have a neural network as follows that eventually outputs a 1 by 1 by 4 volume which is the output of your softmax unit and again to simplify the drawing here 14 by 14 by 3 is technically a volume 5 by 5 or 10 by 10 by 16 a second clear volume but to simplify the drawing for this slide I'm just only draw the front face of these volumes so instead of drawing you know 1 by 1 by 400 volume I'm just only draw the 1 by 1 parts of all of these right so just drop the 3d component of these drawings just for this slide so let's say that you're confident inputs 14 by 14 images or 14 by 14 by 3 images and your test set image is 16 by 16 by 3 so now added that yellow stripe to the border of this image so in the original sliding windows algorithm you might want to input the blue region into a confident and run that once to generate a classification 0 1 and then slide it down a bit let's use the stride of 2 pixels and and then you might slide oh and then you might slide that to the right by 2 pixels to input this green rectangle into the confident and rerun the whole continent and get another label 0 1 and then you might input this orange region into the confident and run it one more time to get another label and then do the fourth and final time with this your lower right now purple square and to run sliding windows on this 16 by 16 by 3 image is pretty small image you run this confident from above 4 times in order to forget 4 labels but it turns out a lot of this computation done by these 4 confidence is highly duplicated so where the convolutional implementation of sliding windows does is it allows these 4 forward passes of the confident to share a lot of computation specifically here's what you can do you can take the confidence and just run it same parameters the same 5 by 5 by filters also 16 5 by 5 filters and run it and now you can have a 12 by 12 by 16 output volume and then do the max cool same as before now you have a 6 by 6 by 16 run through your same 400 5.5 filters to get now your 2 by 2 by 40 volume so now instead of a 1 by 1 so now instead of a 1 by 1 by 400 volume you have the said a 2 by 2 by 400 volume Ramnath 301 by 1 filter gives you another 2 by 2 by 407 one by one by 400 do that one more time and now you're left with a 2 by 2 by 4 output volume is that 1 by 1 by 4 and it turns out that this blue one by one by four subset gives you the result of running in the upper left-hand corner 14 by 14 image this upper right 1 by 1 by 4 volume gives you the upper right result the lower left gives you the results of implementing the content on the lower left 14 by 14 region and the lower right 1 by 1 by 4 volume gives you the same result as running the confident on the lower right 14 by 14 media and if you step through all the steps of the calculation let's look at the green example if you had cropped out just this region and passed it through the confident through the confident on top then the first layers activations would have been exactly this region the next layers activation of the mass cooling would have been exactly this region and then the next layer the next layer would have been as follows so what this process does what this convolutional inclination does is instead of forcing you to run for a propagation on four subsets of the input image independently instead it combines all four into one for computation and shares a lot of the computation in the regions of the image that are common so all four of the 14 by 14 patches we saw here now let's just go through a bigger example let's say you now want to run sliding windows on a 28 by 28 by 3 image it turns out if you run for crop the same way then you end up with an 8 by 8 by 4 output and this corresponds to running sliding windows with that 14 by 14 region and that corresponds to running sliding windows first on that region does giving you the output corresponding on the upper left-hand corner then using Strider two to shift one window over one window over one window over and so on and there ain't position so that gives you this first row and then as you go down the image as well that gives you all of these 8 by 8 by 4 outputs and the N is because of the max pooling of 2 that this corresponds to running your neural network with a stride of 2 on the original image so just to recap to implement sliding windows previously what you do is you drop out a region let's say this is on 14 by 14 and run that to your content and do that for the next region over then do that for the next 14 by 14 region then the next one then the next one the next one the next one and so on until hopefully that one recognizes the car but now instead of doing it sequentially with this convolutional implementation that you saw in the previous slide you can implement the entire image or maybe twenty by twenty eight and convolutional 'i make all the predictions at the same time by one for pass through this big confident in we have it recognize the position of the car so that's how you implement sliding windows convolutional v and it makes the whole thing much more efficient now this algorithm store has one weakness which is the position of the bounding boxes is not going to be too accurate in the next video let's see how you can fix that problem\n"