Involution - Inverting the Inherence of Convolution for Visual Recognition (Research Paper Explained)

The construction of weights is a crucial aspect of neural networks that remains position agnostic, allowing for flexibility and adaptability in various scenarios. The weight sharing mechanism, on the other hand, is an independent idea that has been shown to be effective in conjunction with broadcasting across channels. Broadcasting and weight sharing are distinct concepts that can be used together to achieve impressive results.

The researchers have experimented with depth-separated convolutional networks, where they share weights across these channels. This approach allows for a significant reduction in the number of parameters, making it more computationally efficient. By sharing weights across channels, the network can leverage the redundancy and structure present in the data. The broadcasting scheme is used to perform this weight sharing, allowing the network to adapt to different tasks and datasets.

The experiments conducted by the researchers demonstrate the effectiveness of their approach. They compare their results with state-of-the-art models, such as ResNets, and show that their model achieves superior performance while reducing the number of parameters by 65%. This is a significant achievement, especially considering that the reduced parameters come from weight sharing, which can be beneficial for certain tasks.

The researchers also investigate the effect of the number of channels in the network on its performance. They find that as the number of channels increases, the difference between the performance of different architectures decreases. However, they also observe that within smaller regimes of data, their approach performs well and is often more competitive with state-of-the-art models.

The researchers highlight the importance of understanding the behavior of their approach in different scenarios. They acknowledge that the effect of weight sharing on accuracy is more pronounced when there is limited data available. This suggests that their approach may be particularly beneficial for tasks where data is scarce or difficult to obtain.

In addition to their experiments, the researchers discuss the potential benefits and limitations of their approach. They note that their model can be used to develop new architectures by combining weight sharing with other mechanisms, such as attention or multi-head attention. This allows for a degree of flexibility and adaptability in the design of neural networks.

The researchers also explore the idea of using parts of their approach in isolation. For example, they consider using the broadcasting scheme in conjunction with convolutional layers to achieve improved performance. They suggest that this could be an attractive option for tasks where there is limited data available.

The code for their experiments and implementation is available online, allowing readers to explore the details of their approach further. This provides a valuable resource for researchers and practitioners interested in exploring new ideas and architectures for neural networks.

In conclusion, the researchers' work highlights the importance of careful design and experimentation when developing new neural network architectures. By leveraging weight sharing and broadcasting across channels, they have achieved impressive results while reducing the number of parameters. Their approach has potential benefits and limitations, and their code provides a useful resource for exploring these ideas further.

"WEBVTTKind: captionsLanguage: enhello there today we're looking at involution inverting the inheritance of convolution for visual recognition by number of researchers of the hong kong university of science and technology bite dance ai lab and peking university in this paper on a high level the researchers try to replace the good old convolution operator in cnns by this new thing called an involution in its essence involution is about halfway between a convolution and a self-attention kind of operation and turns out that with some clever weight sharing scheme you can achieve very good performance compared to cnn's and self-attention networks while keeping the number of parameters and the computational cost relatively low this i think is very much worth trying for anyone who does not operate on extremely large scale problems um yeah so we'll get into that a bit more when we go into the experiments but for now let's go through the paper through what involution is what it does how it's different and yeah so if you like this you know don't hesitate share it out it would help a lot we're on the road to 100k subscribers and with every subscriber i get a subscriber i stole that joke um so they say here in the abstract convolution has been the core ingredient of modern neural networks triggering the surge of deep learning in vision which you know correct uh alexnet resnets etc convolution even though transformers are slowly taking over computer vision convolutions are still very very much uh used and if you're not on a super large scale problem a convolutional neural network is still very probably the best way to go if you have a computer vision problem they say we rethink the inherent principles of standard convolution for vision tasks specifically spatial agnostic and channel specific instead we present a novel atomic operation for deep neural networks by inverting the aforementioned design principles of convolution coined an involution okay and they say we additionally demystify the recent popular self-attention operator and subsume it into our involution family as an over complicated instantiation so the um a lot of statements in this paper are let's they are true uh especially further down a lot of the experiments are really cool but it is a bit of an overstatement uh what they say uh right here so their claim is that in if you have a convolution what you do you do something that's spatial agnostic and channel specific which means that in a in a convolutional neural network when you have an image let's say with a bunch of pixels these are now true pixels not patches and you run a convolutional layer over it you run a convolutional kernel over it you put the center of the kernel at some pixel then so the kernel will be something like a three by three kernel you put that on the center here so it overlaps here you multiply element wise and then you aggregate and you can do that in multiple channels but essentially you do that and then after you've done that you move you move the kernel one let's say to the right you shift it so the center is here you do the same thing again and you shift it you do the same thing again so it's spatial agnostic because it repeats the same computation over and over and over across the image and it doesn't care where the computation is right it does the same computation and that is the selling point of convolutional neural networks they are translation invariant this is it's a form of weight sharing right you share the weights across the locations and therefore you don't really care where stuff is in the image that the cnn will be able to recognize it just as well and you don't need to learn over and over and over the same principle just because it's in different parts of the image so this is spatial agnostic what does channel specific mean that for that we have to go into the multiple channels uh realm so if if your image has multiple channels let's say i draw a new image right here with a bunch of pixels and it has multiple channels that means you can imagine it sort of as a as a 3d tensor here where each pixel is a column and every column is a vector of a certain dimensionality i mean so the original image has of course three channels which is red green and blue but if you have intermediate representations these channels can grow to sizes of hundreds of channels and the the point of the channels is every entry here is a number and every number can sort of capture one aspect of what's described in that particular pixel right so maybe the first channel is uh is there a corner the second one is there an edge the third one is there is it was it originally a blue pixel uh the fourth one is there probably a cat here and so on so that these are like the different features in in the channels and the convolution operator is channel specific that means if you have the kernel now convolutional kernels aren't as easy as i drew them they're in fact four dimensional tensors so that is um they are four dimensional tensors which makes it a little bit complicated for me to draw honestly however if you you can imagine that you have one kernel like so okay that has the same amount of channels as your image okay so now you can still do the same operation right you can overlay your kernel on a part of the image you can overlay it like so and no that's in the back and then you can do element wise multiplication and then you do an sum you sum it all up right after you do this operation you do a big sum over all the elements of whatever your kernel multiplied with your image and that gives you one number you do an all reduce one number gives you one number and so you do this so this is one kernel but you have another one right here um yeah like this and you do the same thing and that gives you also one number and you have another kernel i think you you get the idea you have another kernel here so you have many of those kernels per layer when you actually if you've never looked at you know how the weights look when you instantiate these layers in a deep learning framework encourage you to do so right a a convolutional layer will have weights that are of the size kernel size by kernel size by input channels by output channels so it's a 4d tensor and this the orange part here is just one of those sub tensors in fact you have as many as you have output channels and that gives you of course when you then go over all of these that gives you the next layer so that becomes in the next layer um so this is the next layer representation right at the point where you overlaid the the kernel in the last thing that will become this column right here okay so you have the orange thing in the first the blue thing in the second channel green thing in the third channel and so on hope this is relatively clear so you have in fact one convolutional kernel per output channel okay so if if you call the orange thing here a convolutional kernel then you have one kernel per output channel and that means it's channel specific okay so this um this is a conscious choice and it makes sense when you think about it because the each output channel means something different right if i want if my output channel means is there a cat at this particular location then i might want to aggregate the last layer's representation differently than if my output channel says well is this uh part of the sky or if the is is there a corner here or something like this so i want to aggregate the weights differently that's why i have to have a different set of weights here here and here because they mean different things so it's spatial agnostic because it does the same computation at every location it's channel specific because it does a different computation at each channel even though it does it for all the locations equally all right so now we're prepared to invert that so convo involution promises we invert this what we want to do is something spatial specific and channel agnostic okay so the first the first thing here is the channel agnostic um if you've seen my last video about mlp mixer this is very much the same idea and the idea is just of hey why do we have different things here uh why do we have different computations can't we just you know apply the same principle we apply to the spatial thing uh where we say you know we just slide the same computation over the image and that is generally fine that's weight sharing it's actually good um why don't we just do this here why don't we aggregate the information in the same way for for all the the different channels and yeah so you can do that you can just have one kernel so instead of having a number of output channels many kernel so you the involution will come up with simply one kernel that it shares across all of the that it shares across all of the channels they have a little picture down here and just look at the at the last step right here so here well sorry i crossed that out here this is the kernel that they they have um sorry it's not even it's not even by number of channels it's actually you just flatten this thing right so it's a k by k by one kernel and you simply push that put that over a location in the image and then you share the computation across so the the image here given that this is all in the same colors it means that you just multiply you broadcast that's the word i was looking for you broadcast the operation across the channels and then you aggregate after that so you can see what involution does is broadcast and then not reduce right you don't reduce at the end to a single number but you keep the uh channels the the channels as they are that's why you only need a k by k pi one because you don't have the different computation for each output channel and you don't reduce across the input channels so you get away with with a lot less parameters so i that's even wrong here just a k by k kernel now that's that's one part the other part is why don't we do something that's um spatial specific spatial specific and now remember what spatial agnostic was spatial agnostic was is we slide the same kernel across the image what they're saying in first instance they're saying things like um or they said something don't know where it was in the picture but they say well what we could do is if we have an image right if we have an image big image and we do something spatial specific what that means is we could have a kernel that's just as big as the image right then no more no more sliding across it it's simply you multiply those things together you broadcast it across these uh across these channels of the image and there you go right that's that's it also uh something that that mlp mixer does right they they just say whatever we don't do slidey slidey anymore um we simply i mean they they do weight sharing but essentially you're trying to get rid of this sliding over you have different weight for each location and that means that the computation actually differs from where stuff is in the image and we know that that is somewhat important because usually the sky is up and uh you know objects in these natural images that humans take might be more in the middle than anywhere else and text goes from left to right and so it's not all super translation and location invariant so it makes sense to have weights that are different for each position but then they run into a problem they say we we couldn't do that very well because now um now it we we can't just input pictures of different resolutions right that's one problem i think the other problem is that this might not work too well um so they come up with a different thing they say can't we make a compromise and they don't call it a compromise they they call it something different but they they say look can we come up with a scheme where we can retain a kernel that's approximately this size like a small kernel but it is different for each location so we still do the sort of classic convolution way of doing things in that we do these local aggregations across neighboring pixels however the kernel that we use here is different from the kernel that we use here and that's different from the kernel that we use here so how could you make a computation where the kernel is always different you do that by coming up with the kernel in a dynamic way so the the authors here they say okay if let's say we're at this pixel right here we care about this neighborhood how can we come up on the fly with a kernel for this particular pixel and their answer is well let's just generate it from the pixel so this is the full involution diagram we've now arrived at this so they are at this neighborhood which is outlined here in this black um black scaffolding grid thing the center pixel is the red pixel here this one and they say we look at that pixel and all its channels and we use that pixel and only that pixel so not the neighborhood we use that pixel to come up with the kernel so they have a computation here which of course is going to be a a small neural network so this is a two layer uh neural network that comes up with the kernel you see this this is simply a here is just a reshape this so you compute the kernel across the neighborhood from the pixel itself okay and that means that every single pixel here unless it's the exact same uh pixel so the exact same color in the first layer but or the exact same representation in the intermediate layers every single location gets its own kernel for the convolution the computation i've already told you is a small neural network specifically it's sort of a bottleneck neural network so it takes the pixel uh representation as a vector sort of bottlenecks it there is a non-linearity here and then it expands it again to the size of the actual kernel okay and then you use that kernel and you broadcast it instead of having one kernel per input channel and then you multiply and then you don't reduce by across the input channels okay sorry uh yeah i said that's it and and that alleviates you from having to have multiple kernels one for each output channel okay now this is the whole involution pipeline there are i would say there are multiple different concepts here so this coming up with the kernel on the fly is one concept and then this broadcasting scheme is an entirely different concept you could do both independently of each other and they do them together um which which i yeah they do ablations further down um but it's sort of two new things in one now the first thing here is very much you might you might think of a tension mechanism as you um as you look at that because it's a form of fast weights right so the weights of the computation they are computed on the fly from the data itself and that is exactly what an attention mechanism does however here you do it in a slightly different way and they say that they have a discussion across about attention right here so they say you know there are there are a bunch of differences so in attention what you'd have is you don't only have you don't only compute your weights from the actual location where you are even in local self-attention you actually compute your weights from more than just the pixel where you are you compute it from the entire region you care about so that's the first thing and then the second thing is you don't in self-attention you have the queries and the keys right so you have your your data your neighborhood let's say and each of those things produces a query and a key right query and i'm going to write the key up here everyone produces a query and a key and then you do this sort of quadratic thing in order to determine what like how you should aggregate your information not in involution in involution you simply don't produce keys you only produce queries if you will or only keys however you want to look at it and then you don't do the quadratic thing rather you immediately interpret this as sort of the weights of aggregation you can write this and they say that you can write this you can interpret this as the positional encodings already being present in these weights because it's now specific to a position whereas in the attention literature you'd have to supply positional encodings so in order for the algorithm to know you know that that this is a different thing sorry that this here is a different thing from this thing here you need to supply it with positional encodings not here because you know the individual channels of this thing immediately refer to different positions right here so the this neural network is very aware what position is where relative to the pixel you're considering so they say this the success of involution explains in part uh why other people had lots of success with leaving away the keys and only using positional encodings together with the query and if i'm not mistaken this is a thing i think you could frame the lambda networks into this category where um at some point like they never do this attention however they they only they rely heavily on positional encodings however you can learn those ahead of time right or or statically all right that's enough of a so this is the connection to attention the connection to attention is the weights are constructed on the fly however here there's no quadratic interaction um there is no soft max and so on just you construct the weights from the pixel in the center uh therefore it's less powerful to frame attention as like well it's a more complicated instantiation of our idea that's a bit that's a bit out there like the authors here they say well attention is just a more complicated thing of our thing and the second thing i i worry a bit about is this is they say well this is um position specific or location specific right they started out with saying convolution is spatial agnostic we want to do something spatial specific this here is also spatial agnostic like if if you get the same pixel at different locations in the image this thing will produce the same weights and the computation will be the same in fact you do this entire computation right here that is a spatially agnostic computation it's just so the difference here is the same difference that you have between slow weights and fast weights where you simply construct the weights of the actual computation on the fly however the way you construct these weights it remains position agnostic so that's the first thing and the second thing yeah the weight sharing i feel is a bit of independent thing now i get it that the two work well together but the the broadcasting and weight sharing thing across the channels it's it's almost a separate um much simpler uh mention and it's a bit related to so if you have a if you have a depth uh separated convolution and you simply share the weights across that that's about what it what it boils down to so so what does that give us in fact it gives us a lot in this paper they do experiments and they compare against uh for example so against resnets and other networks with similar number of parameters and i like these experiments here in that you can see they always make sure that they have the lowest number of parameters among the things they compare with right yet they show that they still beat uh these models they still they still are better than the models they compare to so they do that and specifically i guess they compare to res net with the same number of layers standalone resnet this i think is self-attention um i think they here's this axial resnet so that has a little bit less uh parameters interestingly enough but yeah so you can see that this outperforms on these tasks right here so this is imagenet they also have different things such as this segmentation task i think they have a picture down here the segmentation task where they perform better so here i think this is the baseline and you can see the involution network it does a better job at this kind of things which which is believable i think the effect that you see right here the fact that the fact that you're they are better in this number is really cool and it's probably a bit you know due to the fact that they do this on the fly computation of weights which is a more powerful idea than the static weights of a convolution and then the lower number of parameters i think is more a result of their weight sharing scheme right they tout here how that they um is on par with resnet 101 regarding the top one recognition accuracy while saving 65 percent of storage and computation so i think that the saving of computation is more due to the weight sharing mechanism and i think they've they've just here selected tasks and they might be important tasks but i think it was just the case that in these tasks um the whether or not you share the weights probably doesn't matter doesn't hit you as hard or is even beneficial if you don't have enough data and therefore that's why they have less parameters so um this what you can also observe here is that differences they get continuously smaller as you move up the scale of network now this is all on the same data set but it would be interesting to see how this performs on you know really large scale because my my intuition is that you know as you go larger and larger in scale this approach is gonna is gonna top out and lose out to the more general architectures like attention and you know whatever mlp's apparently it's a clown world now but in in the regimes in these regimes and i would argue these are the regimes where a lot of practitioners care about these and actually smaller regimes so not many people are in the super high data regime this seems to perform reasonably well right um so you can see right here the the the curves here uh when you compare compute to accuracy is very favorable um as a again especially if you're in like this region here uh if you're in the in the low resource region it might be something that you want to try out it remains to be seen how how well this is pre-trainable and fine-tunable and so on but it's something you might want to try also if you try try to only use sort of parts of it it would be interesting to see you know if we if we still do convolution but we do this sort of weight sharing scheme this broadcasting scheme and yeah they also have a notion of of grouping in in the channels so as i think the i think as the attention mechanism yeah has it so they here they say it however sharing a single kernel across all channels obviously underperforms inaccuracy uh considering channel redundancy of involution kernels as long as it's setting the channels shared in a group to an acceptable range channel agnostic behavior will not only reserve i guess preserve the performance but also reduce the parameter count and computational cost this will also permit the larger kernel size under the same budget so it's sort of the same reasoning as people introducing groups or different heads in multi-head attention yeah so try all of this stuff out i think it's worth it the code is available code is available right here and i'll also put a link to that and that was it from me for this paper i wish you a very pleasant whatever the day of the week is and bye byehello there today we're looking at involution inverting the inheritance of convolution for visual recognition by number of researchers of the hong kong university of science and technology bite dance ai lab and peking university in this paper on a high level the researchers try to replace the good old convolution operator in cnns by this new thing called an involution in its essence involution is about halfway between a convolution and a self-attention kind of operation and turns out that with some clever weight sharing scheme you can achieve very good performance compared to cnn's and self-attention networks while keeping the number of parameters and the computational cost relatively low this i think is very much worth trying for anyone who does not operate on extremely large scale problems um yeah so we'll get into that a bit more when we go into the experiments but for now let's go through the paper through what involution is what it does how it's different and yeah so if you like this you know don't hesitate share it out it would help a lot we're on the road to 100k subscribers and with every subscriber i get a subscriber i stole that joke um so they say here in the abstract convolution has been the core ingredient of modern neural networks triggering the surge of deep learning in vision which you know correct uh alexnet resnets etc convolution even though transformers are slowly taking over computer vision convolutions are still very very much uh used and if you're not on a super large scale problem a convolutional neural network is still very probably the best way to go if you have a computer vision problem they say we rethink the inherent principles of standard convolution for vision tasks specifically spatial agnostic and channel specific instead we present a novel atomic operation for deep neural networks by inverting the aforementioned design principles of convolution coined an involution okay and they say we additionally demystify the recent popular self-attention operator and subsume it into our involution family as an over complicated instantiation so the um a lot of statements in this paper are let's they are true uh especially further down a lot of the experiments are really cool but it is a bit of an overstatement uh what they say uh right here so their claim is that in if you have a convolution what you do you do something that's spatial agnostic and channel specific which means that in a in a convolutional neural network when you have an image let's say with a bunch of pixels these are now true pixels not patches and you run a convolutional layer over it you run a convolutional kernel over it you put the center of the kernel at some pixel then so the kernel will be something like a three by three kernel you put that on the center here so it overlaps here you multiply element wise and then you aggregate and you can do that in multiple channels but essentially you do that and then after you've done that you move you move the kernel one let's say to the right you shift it so the center is here you do the same thing again and you shift it you do the same thing again so it's spatial agnostic because it repeats the same computation over and over and over across the image and it doesn't care where the computation is right it does the same computation and that is the selling point of convolutional neural networks they are translation invariant this is it's a form of weight sharing right you share the weights across the locations and therefore you don't really care where stuff is in the image that the cnn will be able to recognize it just as well and you don't need to learn over and over and over the same principle just because it's in different parts of the image so this is spatial agnostic what does channel specific mean that for that we have to go into the multiple channels uh realm so if if your image has multiple channels let's say i draw a new image right here with a bunch of pixels and it has multiple channels that means you can imagine it sort of as a as a 3d tensor here where each pixel is a column and every column is a vector of a certain dimensionality i mean so the original image has of course three channels which is red green and blue but if you have intermediate representations these channels can grow to sizes of hundreds of channels and the the point of the channels is every entry here is a number and every number can sort of capture one aspect of what's described in that particular pixel right so maybe the first channel is uh is there a corner the second one is there an edge the third one is there is it was it originally a blue pixel uh the fourth one is there probably a cat here and so on so that these are like the different features in in the channels and the convolution operator is channel specific that means if you have the kernel now convolutional kernels aren't as easy as i drew them they're in fact four dimensional tensors so that is um they are four dimensional tensors which makes it a little bit complicated for me to draw honestly however if you you can imagine that you have one kernel like so okay that has the same amount of channels as your image okay so now you can still do the same operation right you can overlay your kernel on a part of the image you can overlay it like so and no that's in the back and then you can do element wise multiplication and then you do an sum you sum it all up right after you do this operation you do a big sum over all the elements of whatever your kernel multiplied with your image and that gives you one number you do an all reduce one number gives you one number and so you do this so this is one kernel but you have another one right here um yeah like this and you do the same thing and that gives you also one number and you have another kernel i think you you get the idea you have another kernel here so you have many of those kernels per layer when you actually if you've never looked at you know how the weights look when you instantiate these layers in a deep learning framework encourage you to do so right a a convolutional layer will have weights that are of the size kernel size by kernel size by input channels by output channels so it's a 4d tensor and this the orange part here is just one of those sub tensors in fact you have as many as you have output channels and that gives you of course when you then go over all of these that gives you the next layer so that becomes in the next layer um so this is the next layer representation right at the point where you overlaid the the kernel in the last thing that will become this column right here okay so you have the orange thing in the first the blue thing in the second channel green thing in the third channel and so on hope this is relatively clear so you have in fact one convolutional kernel per output channel okay so if if you call the orange thing here a convolutional kernel then you have one kernel per output channel and that means it's channel specific okay so this um this is a conscious choice and it makes sense when you think about it because the each output channel means something different right if i want if my output channel means is there a cat at this particular location then i might want to aggregate the last layer's representation differently than if my output channel says well is this uh part of the sky or if the is is there a corner here or something like this so i want to aggregate the weights differently that's why i have to have a different set of weights here here and here because they mean different things so it's spatial agnostic because it does the same computation at every location it's channel specific because it does a different computation at each channel even though it does it for all the locations equally all right so now we're prepared to invert that so convo involution promises we invert this what we want to do is something spatial specific and channel agnostic okay so the first the first thing here is the channel agnostic um if you've seen my last video about mlp mixer this is very much the same idea and the idea is just of hey why do we have different things here uh why do we have different computations can't we just you know apply the same principle we apply to the spatial thing uh where we say you know we just slide the same computation over the image and that is generally fine that's weight sharing it's actually good um why don't we just do this here why don't we aggregate the information in the same way for for all the the different channels and yeah so you can do that you can just have one kernel so instead of having a number of output channels many kernel so you the involution will come up with simply one kernel that it shares across all of the that it shares across all of the channels they have a little picture down here and just look at the at the last step right here so here well sorry i crossed that out here this is the kernel that they they have um sorry it's not even it's not even by number of channels it's actually you just flatten this thing right so it's a k by k by one kernel and you simply push that put that over a location in the image and then you share the computation across so the the image here given that this is all in the same colors it means that you just multiply you broadcast that's the word i was looking for you broadcast the operation across the channels and then you aggregate after that so you can see what involution does is broadcast and then not reduce right you don't reduce at the end to a single number but you keep the uh channels the the channels as they are that's why you only need a k by k pi one because you don't have the different computation for each output channel and you don't reduce across the input channels so you get away with with a lot less parameters so i that's even wrong here just a k by k kernel now that's that's one part the other part is why don't we do something that's um spatial specific spatial specific and now remember what spatial agnostic was spatial agnostic was is we slide the same kernel across the image what they're saying in first instance they're saying things like um or they said something don't know where it was in the picture but they say well what we could do is if we have an image right if we have an image big image and we do something spatial specific what that means is we could have a kernel that's just as big as the image right then no more no more sliding across it it's simply you multiply those things together you broadcast it across these uh across these channels of the image and there you go right that's that's it also uh something that that mlp mixer does right they they just say whatever we don't do slidey slidey anymore um we simply i mean they they do weight sharing but essentially you're trying to get rid of this sliding over you have different weight for each location and that means that the computation actually differs from where stuff is in the image and we know that that is somewhat important because usually the sky is up and uh you know objects in these natural images that humans take might be more in the middle than anywhere else and text goes from left to right and so it's not all super translation and location invariant so it makes sense to have weights that are different for each position but then they run into a problem they say we we couldn't do that very well because now um now it we we can't just input pictures of different resolutions right that's one problem i think the other problem is that this might not work too well um so they come up with a different thing they say can't we make a compromise and they don't call it a compromise they they call it something different but they they say look can we come up with a scheme where we can retain a kernel that's approximately this size like a small kernel but it is different for each location so we still do the sort of classic convolution way of doing things in that we do these local aggregations across neighboring pixels however the kernel that we use here is different from the kernel that we use here and that's different from the kernel that we use here so how could you make a computation where the kernel is always different you do that by coming up with the kernel in a dynamic way so the the authors here they say okay if let's say we're at this pixel right here we care about this neighborhood how can we come up on the fly with a kernel for this particular pixel and their answer is well let's just generate it from the pixel so this is the full involution diagram we've now arrived at this so they are at this neighborhood which is outlined here in this black um black scaffolding grid thing the center pixel is the red pixel here this one and they say we look at that pixel and all its channels and we use that pixel and only that pixel so not the neighborhood we use that pixel to come up with the kernel so they have a computation here which of course is going to be a a small neural network so this is a two layer uh neural network that comes up with the kernel you see this this is simply a here is just a reshape this so you compute the kernel across the neighborhood from the pixel itself okay and that means that every single pixel here unless it's the exact same uh pixel so the exact same color in the first layer but or the exact same representation in the intermediate layers every single location gets its own kernel for the convolution the computation i've already told you is a small neural network specifically it's sort of a bottleneck neural network so it takes the pixel uh representation as a vector sort of bottlenecks it there is a non-linearity here and then it expands it again to the size of the actual kernel okay and then you use that kernel and you broadcast it instead of having one kernel per input channel and then you multiply and then you don't reduce by across the input channels okay sorry uh yeah i said that's it and and that alleviates you from having to have multiple kernels one for each output channel okay now this is the whole involution pipeline there are i would say there are multiple different concepts here so this coming up with the kernel on the fly is one concept and then this broadcasting scheme is an entirely different concept you could do both independently of each other and they do them together um which which i yeah they do ablations further down um but it's sort of two new things in one now the first thing here is very much you might you might think of a tension mechanism as you um as you look at that because it's a form of fast weights right so the weights of the computation they are computed on the fly from the data itself and that is exactly what an attention mechanism does however here you do it in a slightly different way and they say that they have a discussion across about attention right here so they say you know there are there are a bunch of differences so in attention what you'd have is you don't only have you don't only compute your weights from the actual location where you are even in local self-attention you actually compute your weights from more than just the pixel where you are you compute it from the entire region you care about so that's the first thing and then the second thing is you don't in self-attention you have the queries and the keys right so you have your your data your neighborhood let's say and each of those things produces a query and a key right query and i'm going to write the key up here everyone produces a query and a key and then you do this sort of quadratic thing in order to determine what like how you should aggregate your information not in involution in involution you simply don't produce keys you only produce queries if you will or only keys however you want to look at it and then you don't do the quadratic thing rather you immediately interpret this as sort of the weights of aggregation you can write this and they say that you can write this you can interpret this as the positional encodings already being present in these weights because it's now specific to a position whereas in the attention literature you'd have to supply positional encodings so in order for the algorithm to know you know that that this is a different thing sorry that this here is a different thing from this thing here you need to supply it with positional encodings not here because you know the individual channels of this thing immediately refer to different positions right here so the this neural network is very aware what position is where relative to the pixel you're considering so they say this the success of involution explains in part uh why other people had lots of success with leaving away the keys and only using positional encodings together with the query and if i'm not mistaken this is a thing i think you could frame the lambda networks into this category where um at some point like they never do this attention however they they only they rely heavily on positional encodings however you can learn those ahead of time right or or statically all right that's enough of a so this is the connection to attention the connection to attention is the weights are constructed on the fly however here there's no quadratic interaction um there is no soft max and so on just you construct the weights from the pixel in the center uh therefore it's less powerful to frame attention as like well it's a more complicated instantiation of our idea that's a bit that's a bit out there like the authors here they say well attention is just a more complicated thing of our thing and the second thing i i worry a bit about is this is they say well this is um position specific or location specific right they started out with saying convolution is spatial agnostic we want to do something spatial specific this here is also spatial agnostic like if if you get the same pixel at different locations in the image this thing will produce the same weights and the computation will be the same in fact you do this entire computation right here that is a spatially agnostic computation it's just so the difference here is the same difference that you have between slow weights and fast weights where you simply construct the weights of the actual computation on the fly however the way you construct these weights it remains position agnostic so that's the first thing and the second thing yeah the weight sharing i feel is a bit of independent thing now i get it that the two work well together but the the broadcasting and weight sharing thing across the channels it's it's almost a separate um much simpler uh mention and it's a bit related to so if you have a if you have a depth uh separated convolution and you simply share the weights across that that's about what it what it boils down to so so what does that give us in fact it gives us a lot in this paper they do experiments and they compare against uh for example so against resnets and other networks with similar number of parameters and i like these experiments here in that you can see they always make sure that they have the lowest number of parameters among the things they compare with right yet they show that they still beat uh these models they still they still are better than the models they compare to so they do that and specifically i guess they compare to res net with the same number of layers standalone resnet this i think is self-attention um i think they here's this axial resnet so that has a little bit less uh parameters interestingly enough but yeah so you can see that this outperforms on these tasks right here so this is imagenet they also have different things such as this segmentation task i think they have a picture down here the segmentation task where they perform better so here i think this is the baseline and you can see the involution network it does a better job at this kind of things which which is believable i think the effect that you see right here the fact that the fact that you're they are better in this number is really cool and it's probably a bit you know due to the fact that they do this on the fly computation of weights which is a more powerful idea than the static weights of a convolution and then the lower number of parameters i think is more a result of their weight sharing scheme right they tout here how that they um is on par with resnet 101 regarding the top one recognition accuracy while saving 65 percent of storage and computation so i think that the saving of computation is more due to the weight sharing mechanism and i think they've they've just here selected tasks and they might be important tasks but i think it was just the case that in these tasks um the whether or not you share the weights probably doesn't matter doesn't hit you as hard or is even beneficial if you don't have enough data and therefore that's why they have less parameters so um this what you can also observe here is that differences they get continuously smaller as you move up the scale of network now this is all on the same data set but it would be interesting to see how this performs on you know really large scale because my my intuition is that you know as you go larger and larger in scale this approach is gonna is gonna top out and lose out to the more general architectures like attention and you know whatever mlp's apparently it's a clown world now but in in the regimes in these regimes and i would argue these are the regimes where a lot of practitioners care about these and actually smaller regimes so not many people are in the super high data regime this seems to perform reasonably well right um so you can see right here the the the curves here uh when you compare compute to accuracy is very favorable um as a again especially if you're in like this region here uh if you're in the in the low resource region it might be something that you want to try out it remains to be seen how how well this is pre-trainable and fine-tunable and so on but it's something you might want to try also if you try try to only use sort of parts of it it would be interesting to see you know if we if we still do convolution but we do this sort of weight sharing scheme this broadcasting scheme and yeah they also have a notion of of grouping in in the channels so as i think the i think as the attention mechanism yeah has it so they here they say it however sharing a single kernel across all channels obviously underperforms inaccuracy uh considering channel redundancy of involution kernels as long as it's setting the channels shared in a group to an acceptable range channel agnostic behavior will not only reserve i guess preserve the performance but also reduce the parameter count and computational cost this will also permit the larger kernel size under the same budget so it's sort of the same reasoning as people introducing groups or different heads in multi-head attention yeah so try all of this stuff out i think it's worth it the code is available code is available right here and i'll also put a link to that and that was it from me for this paper i wish you a very pleasant whatever the day of the week is and bye byehello there today we're looking at involution inverting the inheritance of convolution for visual recognition by number of researchers of the hong kong university of science and technology bite dance ai lab and peking university in this paper on a high level the researchers try to replace the good old convolution operator in cnns by this new thing called an involution in its essence involution is about halfway between a convolution and a self-attention kind of operation and turns out that with some clever weight sharing scheme you can achieve very good performance compared to cnn's and self-attention networks while keeping the number of parameters and the computational cost relatively low this i think is very much worth trying for anyone who does not operate on extremely large scale problems um yeah so we'll get into that a bit more when we go into the experiments but for now let's go through the paper through what involution is what it does how it's different and yeah so if you like this you know don't hesitate share it out it would help a lot we're on the road to 100k subscribers and with every subscriber i get a subscriber i stole that joke um so they say here in the abstract convolution has been the core ingredient of modern neural networks triggering the surge of deep learning in vision which you know correct uh alexnet resnets etc convolution even though transformers are slowly taking over computer vision convolutions are still very very much uh used and if you're not on a super large scale problem a convolutional neural network is still very probably the best way to go if you have a computer vision problem they say we rethink the inherent principles of standard convolution for vision tasks specifically spatial agnostic and channel specific instead we present a novel atomic operation for deep neural networks by inverting the aforementioned design principles of convolution coined an involution okay and they say we additionally demystify the recent popular self-attention operator and subsume it into our involution family as an over complicated instantiation so the um a lot of statements in this paper are let's they are true uh especially further down a lot of the experiments are really cool but it is a bit of an overstatement uh what they say uh right here so their claim is that in if you have a convolution what you do you do something that's spatial agnostic and channel specific which means that in a in a convolutional neural network when you have an image let's say with a bunch of pixels these are now true pixels not patches and you run a convolutional layer over it you run a convolutional kernel over it you put the center of the kernel at some pixel then so the kernel will be something like a three by three kernel you put that on the center here so it overlaps here you multiply element wise and then you aggregate and you can do that in multiple channels but essentially you do that and then after you've done that you move you move the kernel one let's say to the right you shift it so the center is here you do the same thing again and you shift it you do the same thing again so it's spatial agnostic because it repeats the same computation over and over and over across the image and it doesn't care where the computation is right it does the same computation and that is the selling point of convolutional neural networks they are translation invariant this is it's a form of weight sharing right you share the weights across the locations and therefore you don't really care where stuff is in the image that the cnn will be able to recognize it just as well and you don't need to learn over and over and over the same principle just because it's in different parts of the image so this is spatial agnostic what does channel specific mean that for that we have to go into the multiple channels uh realm so if if your image has multiple channels let's say i draw a new image right here with a bunch of pixels and it has multiple channels that means you can imagine it sort of as a as a 3d tensor here where each pixel is a column and every column is a vector of a certain dimensionality i mean so the original image has of course three channels which is red green and blue but if you have intermediate representations these channels can grow to sizes of hundreds of channels and the the point of the channels is every entry here is a number and every number can sort of capture one aspect of what's described in that particular pixel right so maybe the first channel is uh is there a corner the second one is there an edge the third one is there is it was it originally a blue pixel uh the fourth one is there probably a cat here and so on so that these are like the different features in in the channels and the convolution operator is channel specific that means if you have the kernel now convolutional kernels aren't as easy as i drew them they're in fact four dimensional tensors so that is um they are four dimensional tensors which makes it a little bit complicated for me to draw honestly however if you you can imagine that you have one kernel like so okay that has the same amount of channels as your image okay so now you can still do the same operation right you can overlay your kernel on a part of the image you can overlay it like so and no that's in the back and then you can do element wise multiplication and then you do an sum you sum it all up right after you do this operation you do a big sum over all the elements of whatever your kernel multiplied with your image and that gives you one number you do an all reduce one number gives you one number and so you do this so this is one kernel but you have another one right here um yeah like this and you do the same thing and that gives you also one number and you have another kernel i think you you get the idea you have another kernel here so you have many of those kernels per layer when you actually if you've never looked at you know how the weights look when you instantiate these layers in a deep learning framework encourage you to do so right a a convolutional layer will have weights that are of the size kernel size by kernel size by input channels by output channels so it's a 4d tensor and this the orange part here is just one of those sub tensors in fact you have as many as you have output channels and that gives you of course when you then go over all of these that gives you the next layer so that becomes in the next layer um so this is the next layer representation right at the point where you overlaid the the kernel in the last thing that will become this column right here okay so you have the orange thing in the first the blue thing in the second channel green thing in the third channel and so on hope this is relatively clear so you have in fact one convolutional kernel per output channel okay so if if you call the orange thing here a convolutional kernel then you have one kernel per output channel and that means it's channel specific okay so this um this is a conscious choice and it makes sense when you think about it because the each output channel means something different right if i want if my output channel means is there a cat at this particular location then i might want to aggregate the last layer's representation differently than if my output channel says well is this uh part of the sky or if the is is there a corner here or something like this so i want to aggregate the weights differently that's why i have to have a different set of weights here here and here because they mean different things so it's spatial agnostic because it does the same computation at every location it's channel specific because it does a different computation at each channel even though it does it for all the locations equally all right so now we're prepared to invert that so convo involution promises we invert this what we want to do is something spatial specific and channel agnostic okay so the first the first thing here is the channel agnostic um if you've seen my last video about mlp mixer this is very much the same idea and the idea is just of hey why do we have different things here uh why do we have different computations can't we just you know apply the same principle we apply to the spatial thing uh where we say you know we just slide the same computation over the image and that is generally fine that's weight sharing it's actually good um why don't we just do this here why don't we aggregate the information in the same way for for all the the different channels and yeah so you can do that you can just have one kernel so instead of having a number of output channels many kernel so you the involution will come up with simply one kernel that it shares across all of the that it shares across all of the channels they have a little picture down here and just look at the at the last step right here so here well sorry i crossed that out here this is the kernel that they they have um sorry it's not even it's not even by number of channels it's actually you just flatten this thing right so it's a k by k by one kernel and you simply push that put that over a location in the image and then you share the computation across so the the image here given that this is all in the same colors it means that you just multiply you broadcast that's the word i was looking for you broadcast the operation across the channels and then you aggregate after that so you can see what involution does is broadcast and then not reduce right you don't reduce at the end to a single number but you keep the uh channels the the channels as they are that's why you only need a k by k pi one because you don't have the different computation for each output channel and you don't reduce across the input channels so you get away with with a lot less parameters so i that's even wrong here just a k by k kernel now that's that's one part the other part is why don't we do something that's um spatial specific spatial specific and now remember what spatial agnostic was spatial agnostic was is we slide the same kernel across the image what they're saying in first instance they're saying things like um or they said something don't know where it was in the picture but they say well what we could do is if we have an image right if we have an image big image and we do something spatial specific what that means is we could have a kernel that's just as big as the image right then no more no more sliding across it it's simply you multiply those things together you broadcast it across these uh across these channels of the image and there you go right that's that's it also uh something that that mlp mixer does right they they just say whatever we don't do slidey slidey anymore um we simply i mean they they do weight sharing but essentially you're trying to get rid of this sliding over you have different weight for each location and that means that the computation actually differs from where stuff is in the image and we know that that is somewhat important because usually the sky is up and uh you know objects in these natural images that humans take might be more in the middle than anywhere else and text goes from left to right and so it's not all super translation and location invariant so it makes sense to have weights that are different for each position but then they run into a problem they say we we couldn't do that very well because now um now it we we can't just input pictures of different resolutions right that's one problem i think the other problem is that this might not work too well um so they come up with a different thing they say can't we make a compromise and they don't call it a compromise they they call it something different but they they say look can we come up with a scheme where we can retain a kernel that's approximately this size like a small kernel but it is different for each location so we still do the sort of classic convolution way of doing things in that we do these local aggregations across neighboring pixels however the kernel that we use here is different from the kernel that we use here and that's different from the kernel that we use here so how could you make a computation where the kernel is always different you do that by coming up with the kernel in a dynamic way so the the authors here they say okay if let's say we're at this pixel right here we care about this neighborhood how can we come up on the fly with a kernel for this particular pixel and their answer is well let's just generate it from the pixel so this is the full involution diagram we've now arrived at this so they are at this neighborhood which is outlined here in this black um black scaffolding grid thing the center pixel is the red pixel here this one and they say we look at that pixel and all its channels and we use that pixel and only that pixel so not the neighborhood we use that pixel to come up with the kernel so they have a computation here which of course is going to be a a small neural network so this is a two layer uh neural network that comes up with the kernel you see this this is simply a here is just a reshape this so you compute the kernel across the neighborhood from the pixel itself okay and that means that every single pixel here unless it's the exact same uh pixel so the exact same color in the first layer but or the exact same representation in the intermediate layers every single location gets its own kernel for the convolution the computation i've already told you is a small neural network specifically it's sort of a bottleneck neural network so it takes the pixel uh representation as a vector sort of bottlenecks it there is a non-linearity here and then it expands it again to the size of the actual kernel okay and then you use that kernel and you broadcast it instead of having one kernel per input channel and then you multiply and then you don't reduce by across the input channels okay sorry uh yeah i said that's it and and that alleviates you from having to have multiple kernels one for each output channel okay now this is the whole involution pipeline there are i would say there are multiple different concepts here so this coming up with the kernel on the fly is one concept and then this broadcasting scheme is an entirely different concept you could do both independently of each other and they do them together um which which i yeah they do ablations further down um but it's sort of two new things in one now the first thing here is very much you might you might think of a tension mechanism as you um as you look at that because it's a form of fast weights right so the weights of the computation they are computed on the fly from the data itself and that is exactly what an attention mechanism does however here you do it in a slightly different way and they say that they have a discussion across about attention right here so they say you know there are there are a bunch of differences so in attention what you'd have is you don't only have you don't only compute your weights from the actual location where you are even in local self-attention you actually compute your weights from more than just the pixel where you are you compute it from the entire region you care about so that's the first thing and then the second thing is you don't in self-attention you have the queries and the keys right so you have your your data your neighborhood let's say and each of those things produces a query and a key right query and i'm going to write the key up here everyone produces a query and a key and then you do this sort of quadratic thing in order to determine what like how you should aggregate your information not in involution in involution you simply don't produce keys you only produce queries if you will or only keys however you want to look at it and then you don't do the quadratic thing rather you immediately interpret this as sort of the weights of aggregation you can write this and they say that you can write this you can interpret this as the positional encodings already being present in these weights because it's now specific to a position whereas in the attention literature you'd have to supply positional encodings so in order for the algorithm to know you know that that this is a different thing sorry that this here is a different thing from this thing here you need to supply it with positional encodings not here because you know the individual channels of this thing immediately refer to different positions right here so the this neural network is very aware what position is where relative to the pixel you're considering so they say this the success of involution explains in part uh why other people had lots of success with leaving away the keys and only using positional encodings together with the query and if i'm not mistaken this is a thing i think you could frame the lambda networks into this category where um at some point like they never do this attention however they they only they rely heavily on positional encodings however you can learn those ahead of time right or or statically all right that's enough of a so this is the connection to attention the connection to attention is the weights are constructed on the fly however here there's no quadratic interaction um there is no soft max and so on just you construct the weights from the pixel in the center uh therefore it's less powerful to frame attention as like well it's a more complicated instantiation of our idea that's a bit that's a bit out there like the authors here they say well attention is just a more complicated thing of our thing and the second thing i i worry a bit about is this is they say well this is um position specific or location specific right they started out with saying convolution is spatial agnostic we want to do something spatial specific this here is also spatial agnostic like if if you get the same pixel at different locations in the image this thing will produce the same weights and the computation will be the same in fact you do this entire computation right here that is a spatially agnostic computation it's just so the difference here is the same difference that you have between slow weights and fast weights where you simply construct the weights of the actual computation on the fly however the way you construct these weights it remains position agnostic so that's the first thing and the second thing yeah the weight sharing i feel is a bit of independent thing now i get it that the two work well together but the the broadcasting and weight sharing thing across the channels it's it's almost a separate um much simpler uh mention and it's a bit related to so if you have a if you have a depth uh separated convolution and you simply share the weights across that that's about what it what it boils down to so so what does that give us in fact it gives us a lot in this paper they do experiments and they compare against uh for example so against resnets and other networks with similar number of parameters and i like these experiments here in that you can see they always make sure that they have the lowest number of parameters among the things they compare with right yet they show that they still beat uh these models they still they still are better than the models they compare to so they do that and specifically i guess they compare to res net with the same number of layers standalone resnet this i think is self-attention um i think they here's this axial resnet so that has a little bit less uh parameters interestingly enough but yeah so you can see that this outperforms on these tasks right here so this is imagenet they also have different things such as this segmentation task i think they have a picture down here the segmentation task where they perform better so here i think this is the baseline and you can see the involution network it does a better job at this kind of things which which is believable i think the effect that you see right here the fact that the fact that you're they are better in this number is really cool and it's probably a bit you know due to the fact that they do this on the fly computation of weights which is a more powerful idea than the static weights of a convolution and then the lower number of parameters i think is more a result of their weight sharing scheme right they tout here how that they um is on par with resnet 101 regarding the top one recognition accuracy while saving 65 percent of storage and computation so i think that the saving of computation is more due to the weight sharing mechanism and i think they've they've just here selected tasks and they might be important tasks but i think it was just the case that in these tasks um the whether or not you share the weights probably doesn't matter doesn't hit you as hard or is even beneficial if you don't have enough data and therefore that's why they have less parameters so um this what you can also observe here is that differences they get continuously smaller as you move up the scale of network now this is all on the same data set but it would be interesting to see how this performs on you know really large scale because my my intuition is that you know as you go larger and larger in scale this approach is gonna is gonna top out and lose out to the more general architectures like attention and you know whatever mlp's apparently it's a clown world now but in in the regimes in these regimes and i would argue these are the regimes where a lot of practitioners care about these and actually smaller regimes so not many people are in the super high data regime this seems to perform reasonably well right um so you can see right here the the the curves here uh when you compare compute to accuracy is very favorable um as a again especially if you're in like this region here uh if you're in the in the low resource region it might be something that you want to try out it remains to be seen how how well this is pre-trainable and fine-tunable and so on but it's something you might want to try also if you try try to only use sort of parts of it it would be interesting to see you know if we if we still do convolution but we do this sort of weight sharing scheme this broadcasting scheme and yeah they also have a notion of of grouping in in the channels so as i think the i think as the attention mechanism yeah has it so they here they say it however sharing a single kernel across all channels obviously underperforms inaccuracy uh considering channel redundancy of involution kernels as long as it's setting the channels shared in a group to an acceptable range channel agnostic behavior will not only reserve i guess preserve the performance but also reduce the parameter count and computational cost this will also permit the larger kernel size under the same budget so it's sort of the same reasoning as people introducing groups or different heads in multi-head attention yeah so try all of this stuff out i think it's worth it the code is available code is available right here and i'll also put a link to that and that was it from me for this paper i wish you a very pleasant whatever the day of the week is and bye bye\n"

Involution - Inverting the Inherence of Convolution for Visual Recognition (Research Paper Explained)

Random Videos