Encoder Decoder Network - Computerphile

**Understanding Deep Learning and Computer Vision**

The field of deep learning and computer vision has made tremendous progress in recent years, enabling machines to interpret and understand visual data from images and videos. One of the key techniques used in this field is called " upsampling" which involves increasing the resolution of an image while retaining its essential features.

**Upsampling: Reversing the Process**

We essentially reverse this process by creating a sort of up sample, which will double the size and then we look over here and bring in some interesting information as well. Right? We can use convolutional layers to learn that mapping so we can take nothing from here or everything from here. It doesn't really matter. Finally, we upsample back to the original size and bring this in here using a sum. What we've actually done is a kind of smart way of making this bigger.

**Features in Images**

These features are very sure what's in the image but only roughly where it is. These features have a much higher pixel resolution they're much more sure in some sense where things are but not exactly what they are. Right? You could imagine in an intuitive way we're saying this is a cat and down here we've seen some texturey fur let's combine them together to outline exactly where the cat is.

**Semantic Segmentation**

This is a kind of idea and you can use this for all kinds of things. People have used it for segmentation or we call semantic segmentation which is where you will label each pixel with a class depending on what's in that pixel. Traditional segmentation usually meant background and foreground now semantic segmentation means maybe hundreds of classes.

**Applications of Semantic Segmentation**

For instance, in the image I'm seeing here it might be you the table the computer the desk the window yeah this kind of thing. There's a huge amount of different applications for that kind of thing. So on a basic level you could imagine just trying to find one object specifically in a scene so just for people it's either person or it's background we don't care what else.

**Object Detection and Tracking**

On the other hand, you could be training this on something like imagenet with lots and lots of classes. There's the MS Coco data set for example that has lots and lots of classes and so you're trying to find airplanes and cars and things and people do this on street scene segmentation as well.

**Object Detection Techniques**

So instead of segmenting the image, instead of outlining where an object is yes or no why don't we try and draw a little heat map of where we think it is. Then we can pinpoint objects so we can say where the two pupils on a face are can we draw around someone's face or their nose or their forehead.

**Applications in Various Fields**

This has implications for things like connect sensors and sort of interactive games but also you know pedestrian tracking and loads of other examples of things where it might be useful to know what a person is up to. Finally, we're using obviously in plant science to try and count objects and localize objects so where's the disease on this image can we produce a heat map that shows exactly where it is?

**Encoder-Decoder Architecture**

This is called an encoder-decoder because sometimes what we're doing is we're encoding our spatial information into some kind of features of what's going on in the scene in general. We remove the spatial resolution in exchange for learning more about the scene and then we bring it back in by finding detail from earlier parts of the network and bringing them in as well that's the decoding stage.

**Gan Comparison**

This is a little bit like a gan in the sense but this is the generator here and this is the discriminator it's just that you would switch them around.

"WEBVTTKind: captionsLanguage: enso where we left it was that we've got ourselves now a fully connected network so it makes no assumptions about the size of the input the number of parameters we're going to have it just adapts itself depending on the size of the input which for images you can imagine makes quite a lot of sense they change size quite a lot but in most other ways it acts exactly like a normal deep network we've talked about this before in other videos like the deep dream one but the deeper you go into the network the sort of higher level information we have on what's going on it's objects and animals and things rather than bits of fur and edges and the shallower we are we have much less idea but the shallower we are we also have much higher spatial resolution because we've got basically the input image size because of these max pooling layers mostly every time we down sample what we're doing is we're taking a small group of pixels and just choosing the best of them the maximum and putting that in the output and that just halves the size of the image and halves it again and has it again and you can imagine if you've got an image of 256 by 256 we might repeat this process four five six times until we've got a very small region it's done for a couple of reasons one is that we want to be invariant to where things are in the image which means that if the dog's over to the right we still want to find it even if it's or over to the left right and so we what we don't want it to be affected by that the other the other issue quite frankly is we don't have enough video ram yet we routinely fill up multiple graphics cards each of which has 12 gigabytes on um it depends on the situation you're looking at this is only one dimension i've drawn here but it's actually two dimensional if you halve the x and y dimensions you're actually dividing the amount of memory required for the next layer by four and then by four again and then by four again and so actually you save an absolutely massive amount of ram by spatially down sampling and without it we'd be stuck with very small networks indeed but we've got this problem that yes we've worked out the cats are in the image or something like this but it's very very small right it's only a few pixels by a few pixels we've got a rough idea there's something going on here maybe we could just balloon it up like like a large linear up sampling and just sort of go well that's roughly a cat but it wouldn't be anything interesting so i guess the interesting thing happened in 2014 when jonathan long proposed a kind of a solution to this right which is essentially a smarter up sampling what we do is we we essentially reverse this process basically we have a sort of an up sample here which will maybe double the size and then we look over here and we bring in some of this interesting information as well right and then we up sample again and we go all right so we can this is now the same size as this so we can bring in some of this information and when i say bring in i mean literally add these to these and we can have convolutional layers here to learn that mapping so we can take nothing from here or everything from here it doesn't really matter and finally we up sample back to the original size and we bring this in here using a sum now what we've actually done is a kind of smart way of making this bigger i mean it's kind of you've got to kind of try and get your head around it but these features are very sure what's in the image but only roughly where it is these features a much higher pixel resolution they're much more sure in some sense where things are but not exactly what they are right so you could imagine in an intuitive way we're saying this is a cat and down here we've seen some texturey fur let's combine them together to outline exactly where the cat is this is a kind of idea and you can use this for all kinds of things so people have used it for segmentation or we call semantic segmentation which is where you will label each pixel with a class depending on what is in that pixel traditional segmentation usually meant background and foreground now semantic segmentation means maybe hundreds of classes so for instance in the image i'm seeing here it might be you the table the computer the desk the window yeah this kind of thing and there's a huge amount of different applications for that kind of thing so on a basic level you could imagine just trying to find one object specifically in a scene so just for people it's either person or it's background we don't care what else or you could be training this on something like imagenet with lots and lots of classes or i mean there's the ms coco data set for example that has lots and lots of classes and so you're trying to find airplanes and cars and things and people do this on street scene segmentation as well so you could say look given this picture of a road where is the road where is the pavement what's a building where are the road signs and actually analyze the entire scene which which is obviously really really quite powerful the other thing is that you don't have to segment the image instead of segmenting it you can just try and find objects you can say instead of just outlining where an object is yes or no why don't we try and draw a little heat map of where we think it is and then we can pinpoint objects so we can say where the two pupils on a face or can we draw around someone's face or their nose or their forehead so that we can then fit a model to that so aaron was doing this in his network where he was actually predicting the 3d and positional information of a face based just on a picture and you've all had to go with that we've also been using it for human pose estimation so where's the right hand where's the left hand what pose is this person currently doing which obviously you can imagine has lots of implications for things like um connect sensors and sort of interactive games but also you know pedestrian tracking and and loads of other examples of things where it might be useful to know what a person is up to and finally we're using obviously in plant science to try and count objects and localize objects so where's the disease on this image can we produce a heat map that shows exactly where it is where are the ears of wheat in this image can we count the number of spikelets to get an estimate of how much yield this wheat is producing compared to this wheat right and then we can start to run experiments on you know these ones are water stressed does that mean this one's better this kind of thing so this is called an encoder decoder because sometimes what we're doing is we're encoding our spatial information into some kind of features of what's going on in the in the scene in general we remove the spatial resolution in exchange for learning more about the scene and then we bring it back in by finding detail from earlier parts of the network and bringing them in as well that's the decoding stage in some sense this is a little bit like a gan in the sense but this is the generator here and this is the discriminator it's just that you would switch them around but let's not go not over complicate things and this one lit up which is maybe pause and maybe this one lit up because here was a few lines in a row and this one is sort of furry texture or something you know and we're getting lower and lower level as we go throughso where we left it was that we've got ourselves now a fully connected network so it makes no assumptions about the size of the input the number of parameters we're going to have it just adapts itself depending on the size of the input which for images you can imagine makes quite a lot of sense they change size quite a lot but in most other ways it acts exactly like a normal deep network we've talked about this before in other videos like the deep dream one but the deeper you go into the network the sort of higher level information we have on what's going on it's objects and animals and things rather than bits of fur and edges and the shallower we are we have much less idea but the shallower we are we also have much higher spatial resolution because we've got basically the input image size because of these max pooling layers mostly every time we down sample what we're doing is we're taking a small group of pixels and just choosing the best of them the maximum and putting that in the output and that just halves the size of the image and halves it again and has it again and you can imagine if you've got an image of 256 by 256 we might repeat this process four five six times until we've got a very small region it's done for a couple of reasons one is that we want to be invariant to where things are in the image which means that if the dog's over to the right we still want to find it even if it's or over to the left right and so we what we don't want it to be affected by that the other the other issue quite frankly is we don't have enough video ram yet we routinely fill up multiple graphics cards each of which has 12 gigabytes on um it depends on the situation you're looking at this is only one dimension i've drawn here but it's actually two dimensional if you halve the x and y dimensions you're actually dividing the amount of memory required for the next layer by four and then by four again and then by four again and so actually you save an absolutely massive amount of ram by spatially down sampling and without it we'd be stuck with very small networks indeed but we've got this problem that yes we've worked out the cats are in the image or something like this but it's very very small right it's only a few pixels by a few pixels we've got a rough idea there's something going on here maybe we could just balloon it up like like a large linear up sampling and just sort of go well that's roughly a cat but it wouldn't be anything interesting so i guess the interesting thing happened in 2014 when jonathan long proposed a kind of a solution to this right which is essentially a smarter up sampling what we do is we we essentially reverse this process basically we have a sort of an up sample here which will maybe double the size and then we look over here and we bring in some of this interesting information as well right and then we up sample again and we go all right so we can this is now the same size as this so we can bring in some of this information and when i say bring in i mean literally add these to these and we can have convolutional layers here to learn that mapping so we can take nothing from here or everything from here it doesn't really matter and finally we up sample back to the original size and we bring this in here using a sum now what we've actually done is a kind of smart way of making this bigger i mean it's kind of you've got to kind of try and get your head around it but these features are very sure what's in the image but only roughly where it is these features a much higher pixel resolution they're much more sure in some sense where things are but not exactly what they are right so you could imagine in an intuitive way we're saying this is a cat and down here we've seen some texturey fur let's combine them together to outline exactly where the cat is this is a kind of idea and you can use this for all kinds of things so people have used it for segmentation or we call semantic segmentation which is where you will label each pixel with a class depending on what is in that pixel traditional segmentation usually meant background and foreground now semantic segmentation means maybe hundreds of classes so for instance in the image i'm seeing here it might be you the table the computer the desk the window yeah this kind of thing and there's a huge amount of different applications for that kind of thing so on a basic level you could imagine just trying to find one object specifically in a scene so just for people it's either person or it's background we don't care what else or you could be training this on something like imagenet with lots and lots of classes or i mean there's the ms coco data set for example that has lots and lots of classes and so you're trying to find airplanes and cars and things and people do this on street scene segmentation as well so you could say look given this picture of a road where is the road where is the pavement what's a building where are the road signs and actually analyze the entire scene which which is obviously really really quite powerful the other thing is that you don't have to segment the image instead of segmenting it you can just try and find objects you can say instead of just outlining where an object is yes or no why don't we try and draw a little heat map of where we think it is and then we can pinpoint objects so we can say where the two pupils on a face or can we draw around someone's face or their nose or their forehead so that we can then fit a model to that so aaron was doing this in his network where he was actually predicting the 3d and positional information of a face based just on a picture and you've all had to go with that we've also been using it for human pose estimation so where's the right hand where's the left hand what pose is this person currently doing which obviously you can imagine has lots of implications for things like um connect sensors and sort of interactive games but also you know pedestrian tracking and and loads of other examples of things where it might be useful to know what a person is up to and finally we're using obviously in plant science to try and count objects and localize objects so where's the disease on this image can we produce a heat map that shows exactly where it is where are the ears of wheat in this image can we count the number of spikelets to get an estimate of how much yield this wheat is producing compared to this wheat right and then we can start to run experiments on you know these ones are water stressed does that mean this one's better this kind of thing so this is called an encoder decoder because sometimes what we're doing is we're encoding our spatial information into some kind of features of what's going on in the in the scene in general we remove the spatial resolution in exchange for learning more about the scene and then we bring it back in by finding detail from earlier parts of the network and bringing them in as well that's the decoding stage in some sense this is a little bit like a gan in the sense but this is the generator here and this is the discriminator it's just that you would switch them around but let's not go not over complicate things and this one lit up which is maybe pause and maybe this one lit up because here was a few lines in a row and this one is sort of furry texture or something you know and we're getting lower and lower level as we go through\n"

Encoder Decoder Network - Computerphile

Random Videos