The Dimpled Manifold Model of Adversarial Examples in Machine Learning (Research Paper Explained)

### Article: Understanding Adversarial Examples Through Manifold Analysis

---

#### **Introduction**

Adversarial examples have long been a puzzle in machine learning, particularly for deep neural networks (DNNs). These are inputs that have been intentionally perturbed to cause misclassification while remaining nearly imperceptible to humans. For instance, a panda image classified correctly by a model might be reclassified as a tiger with minimal pixel changes. Understanding why these adversarial examples exist and how they work has been a focal point of research in recent years.

In this article, we explore two contrasting perspectives on adversarial examples: the "dimpled manifold hypothesis" proposed by Yann LeCun and his colleagues at Facebook AI Research (FAIR), and the "stretchy features hypothesis" that counters it. Both approaches aim to explain why neural networks are so sensitive to adversarial perturbations while humans are not.

---

#### **The Dimpled Manifold Hypothesis**

The dimpled manifold hypothesis, introduced by LeCun and his team, suggests that neural networks classify inputs based on a "low-dimensional manifold" of natural images. According to this model, the decision boundary of a trained network follows the structure of the input data (the manifold) except for small regions (dimples) around each training example. These dimples represent the areas where the network is vulnerable to adversarial attacks.

In their research, LeCun and his team conducted synthetic experiments to demonstrate this hypothesis. They trained an autoencoder to compress images into a low-dimensional representation and then linearized around it. By doing so, they could project gradients onto or off the manifold while measuring the resulting perturbation norms. Their findings were striking: when forced to stay on the manifold, the adversarial examples required significantly larger perturbations (up to six times) compared to unconstrained attacks. This suggested that the network's decision boundary closely follows the data manifold, making it harder for adversarial examples to emerge from within.

---

#### **The Stretchy Features Hypothesis**

In response to the dimpled manifold hypothesis, researchers have proposed an alternative explanation: the "stretchy features hypothesis." This perspective argues that neural networks are sensitive to adversarial perturbations because they rely on certain high-dimensional features that are stretched or compressed in a way that makes them vulnerable to small changes.

For example, consider a network trained to classify cats and dogs. The network may place significant weight on the "fur" feature, which is highly sensitive to small changes, while being relatively invariant to the "shape" feature. This means that altering the fur feature slightly can cause misclassification, even if the overall shape of the image remains unchanged. In this view, adversarial examples exploit these stretched features rather than bending the decision boundary around the data manifold.

---

#### **Synthetic vs. Real-World Experiments**

LeCun and his team conducted synthetic experiments to support their dimpled manifold hypothesis. They trained an autoencoder on a dataset of natural images, linearized the representation, and then measured perturbation norms for constrained (on-manifold) and unconstrained (off-manifold) attacks. Their results showed that on-manifold perturbations required significantly larger norm values (up to six times) compared to unconstrained attacks, suggesting that the decision boundary closely follows the data manifold.

However, critics argue that these synthetic experiments do not fully capture the complexity of real-world scenarios. For instance, in a real-world experiment involving panda and tiger images, researchers found that the perturbation norms for on-manifold and off-manifold attacks were nearly identical. This discrepancy suggests that the dimpled manifold hypothesis may not fully explain adversarial sensitivity.

---

#### **The Role of Feature Utilization**

A key point of contention between the two hypotheses lies in their differing views on feature utilization. The dimpled manifold hypothesis emphasizes the importance of aligning decision boundaries with the data manifold, while the stretchy features hypothesis focuses on how networks prioritize certain high-dimensional features over others.

In a recent experiment, researchers demonstrated that when a network is forced to project gradients onto a random low-dimensional subspace (rather than the image manifold), it still produces similar perturbation norms for on-manifold and off-manifold attacks. This suggests that the observed differences in perturbation norms may not be due to manifold alignment but rather to the specific features being optimized during training.

---

#### **Conclusion**

The debate between the dimpled manifold hypothesis and the stretchy features hypothesis highlights the complexity of understanding adversarial examples. While LeCun's team has provided compelling evidence for the importance of manifold alignment, critics argue that other factors—such as feature utilization—are equally or more critical in explaining adversarial sensitivity.

Ultimately, both perspectives contribute valuable insights into the nature of neural network decision boundaries and their vulnerabilities. As research continues, it is likely that a more comprehensive understanding will emerge, one that integrates elements of both hypotheses.

---

This article provides a detailed exploration of two competing explanations for adversarial examples, offering readers a deeper understanding of the challenges and nuances involved in building robust machine learning models.

"WEBVTTKind: captionsLanguage: enhello there today we're going to look at the dimpled manifold model of adversarial examples in machine learning by adi shamir odalia melamed and oriel ben shmuel this paper on a high level proposes a new way of looking at the phenomenon of adversarial examples in machine learning specifically in deep learning and they proposed this model called the dimpled manifold model essentially arguing that classifiers put their decision boundaries right next to the manifold of data while only slightly sort of curving it around the data like this now the data manifold being low dimensional this results in a situation where you can cross the decision boundary really easily if you simply go perpendicular to the data manifold which also is perpendicular to the decision boundary and if because it's just such a small dimple there uh the decision boundary is pretty close and that's how you end up with adversarial examples that are super easy to get so it's not a new attack a new defense anything like this it's simply a mental framework of explaining why adversarial examples exist on a high level they have some conceptual thought experiments uh they have um some explanations and some real world experiments now i personally don't think that this is entirely it's not necessarily incorrect but i don't think that this is really useful to think in this way and i'm gonna explain why in general my opinion of this is it doesn't really add anything and i think it explains less than the models we already had um yeah so that's that's my opinion i'm gonna get to it specifically also the experiments they propose i think that there is a big occam's razor failure right there but as i said we're gonna get to all of this and go through the paper and i want you to make up your own mind even though i'm going to try to bias you so uh yeah this is this is not a neutral channel in case you haven't noticed all right so if you you know like content or if you dislike it tell me in the comments tell me what you think of the paper whether it makes sense whether it doesn't make sense and so on i'd be very interested to see what you have to say uh yeah i read the comments so please they say the extreme fragility of deep neural networks when presented with tiny perturbations okay this starts out how every single adversarial example's paper always starts out saying okay deep neural networks are extremely fragile there's this phenomenon of adversarial examples now if you don't know what adversarial examples are really briefly essentially what this is it's a phenomenon where you take an image like the thing here on the left uh the neural network thinks it's a plane with a very high probability and you change it to this thing right here which you as a human can't even tell it's different however the neural network will think that this is now a bird with very high probability and the this is the change that you made uh it's magnified for you to see it kind of looks like random noise but it's a very particular noise that makes the neural network think it's something different and this is just it's tiny in the in its norm right so you don't see a difference now bird here is kind of close to plane but you can change this into anything literally anything you want you can change this into banana or uh i don't know dog or any class you want using these techniques so it's not about being close it's really kind of a separate phenomenon so that's adversarial examples and many frameworks have been proposed in order to explain these adversarial examples and they make a they make a nice overview right here um many had been proposed over the last 18 last eight years that dnns are too non-linear that they're too linear that they were trained with insufficient number of training examples that are just rare cases where they err that images contain robust and non-robust features etc they say however none of these vague qualitative ideas seem to provide a simple intuitive explanations for the existence and bizarre properties of adversarial examples so that is pretty harsh criticism specifically the first ones are kind of yeah but specifically this last one that images contain robust and non-robust features which is sort of the leading hypothesis right now of why adversarial examples exist and what they are and them here saying none of these can none of these vague qualitative ideas seem to provide a simple intuitive explanation for the existence like let's see whether or not they're gonna do better okay so um also in the abstract they go on and they say okay they introduced this new conceptual framework which they call the dimpled manifold model which provides a simple explanation for why adversarial examples exist why their perturbations have such tiny norms why these perturbations look like random noise and why a network which was adversarially trained with incorrectly labeled images can still correctly classify test images now this last part if you're not familiar with the literature it might come to you a bit random this why network which was adversarially trained with incorrectly labeled images can still correctly classify test images this is a famous experiment from the group of alexander madri where also this this hypothesis this one the robust and non-robust feature comes from and any any attempt at explaining adversarial examples after this paper has to explain why that experiment makes sense because it's kind of a non-intuitive experiment and we're gonna get to that as well but just so you know that's why they write it in the abstract now i personally think they don't have a good like this model here doesn't have a good explanation for why that works uh they're sort of hand-wavy trying in any case so they say in in the last part of the paper we describe the results of numerous experiments which strongly support this new model and in particular our assertion that adversarial perturbations are roughly perpendicular to the low dimensional manifold which contains all the training examples okay also remember this experiment they strongly support what in particular the assertion that adversarial perturbations are roughly perpendicular to the low dimensional manifold which contains all the training examples now remember this that the experiments are supposed to support this particular claim because also that is going to be important down the road okay so let's get into the dimpled manifold model what is it what do these authors propose and i'm going to try as best as i can to say what the authors are saying in the paper so they claim that there is an old mental image of adversarial examples and the old mental image is um is here it's uh they say we think the old mental image is based on the highly misleading 2d image on the left side of figure 1 and that's this thing right here so the old mental image is that there's a there is a data space right this here if you think of pic of images as data points this would be the pixel space right so this is images with two pixels right now in this conceptual framework but you have to sort of think yourself into higher dimension so they claim the old mental images the following you have sort of the data distributed somehow in this space the data being the all the set of natural images or images you consider which is kind of these these subspace these subgroups right here there are a bunch of images right there and there and also there and there so these are images of two different classes the red class and the blue class now they're distributed like this and what is a classifier supposed to do a classifier is supposed to put a decision boundary between them and that's what they draw in here so this would be sort of a reasonable decision boundary between the two classes right so now what do you do if you want to create an adversarial examples well necessarily you have to start at an image of a class this one maybe and you have to cross the decision boundary right you want to fool the classifier ergo necessarily by definition you have to cross the decision boundary so what do you do the the easiest way to do this is to sort of go straight towards the decision boundary which is approximately in this direction right here and then once you cross the decision boundary you are done you're on the other side you have created an adversarial example provided of course that the image still kind of looks like the original image okay so they say this has this has many many problems here they say the in this mental this mental image adversarial examples are created by moving the given images along the green arrows towards some kind of centroid of the nearest training images with the opposite label in which they mean this this thing right here so we would move the images towards the other class towards the images of the other class and they say as stated for example by ian goodfellow in his lecture at this time i'm going to cut this in right here i've i've said that the same perturbation can fool many different models or the same perturbation can be applied to many different clean examples i've also said that the subspace of adversarial perturbations is only about 50 dimensional even if the input dimension is 3000 dimensional so how is it that these subspaces intersect the reason is that the choice of the subspace directions is not completely random it's generally going to be something like pointing from one class centroid to another class centroid and if you look at that vector and visualize it as an image it might not be meaningful to a human just because humans aren't very good at imagining what class centroids look like and we're really bad at imagining differences between centroids but there is more or less this systematic effect that causes different models to learn similar linear functions just because they're trying to solve the same task okay so it really appears like goodfellow says this thing right here however they claim now they claim this doesn't make sense so they claim that you should think about adversarial examples in a different way and this is their dimpled manifold hypothesis so what is their dimpled manifold hypothesis they say what you have to do is you have to think about the data manifold in the higher dimensional space that they have the higher dimensional input space so in this case they consider instead of here this 2d landscape they consider the 3d landscape so this would be the pixel space right now we consider three pixel images and the data is embedded in a low dimensional manifold in this higher space so because if you think about all combinations of pixels that are possible so not all of them are natural images in fact only very few of the possible combinations of pixels are natural images or images that make you know sense to you as a human or are images that you could potentially generate by going out with a camera so the data you're considering lives on a very low dimensional manifold in this big space and you have to explicitly think about that now the data is the data manifold here is represented in this in this sheet in the middle and on this manifold you're going to have your different classes of of data here the blue or one class and the red are the other class what this paper claims is that what classifiers do what neural networks do when they classify the training data here is they go and they lay their decision boundary instead of so in the old model you would have thought maybe something like this happened where you put your decision boundary sort of in the middle between the two classes right crossing the manifold right here so you sort of put it in the middle between the two classes and then when you have to create an adversarial example again what you would do is you would maybe start here what you have to do is you would go straight towards the decision boundary right here okay crossing the decision boundary and then on the other side you'd have an adversarial example in this new model what they claim is the decision boundary actually doesn't look like this right here okay the decision boundary actually is very much aligned with the manifold of data as you can see right here so this mesh that they show is the decision boundary now and their claim is that that usually just aligns with the manifold of data however um around the actual data around the training samples what the classifier will do is it will create these what these dimples okay and these dimples are just tiny well dimples tiny perturbations in the decision manifold such that the data is on the correct side of the decision manifold sorry of the decision boundary right so the blue points here are under or one side of the decision boundary and the red points are on the other side of the decision boundary and for the rest the decision boundary just aligns with the data the data manifold now if you want to make an adversarial example now what you have to do again you start from an image and again you walk straight towards the decision boundary however now you don't have to go um like this you so what you can do is you can go simply perpendicular to the data manifold and you will cross the decision boundary very quickly because the dimple you're in is is kind of shallow and they give a reason why the dimples are shallow because they claim this is um results from training these models and that explains some things so the difference is the difference is we started out from this to make an adversarial example we have to go towards the decision boundary okay if we sort of transfer this image into higher dimensions it looks like this in the middle again in order to make an adversarial example we have to go towards the decision boundary now in the old mental image going perpendicular to the decision boundary means walking on the data manifold because we walk from this group of data towards this group of data okay you can see right here that we're walking on the data manifold when we walk perpendicular to the decision boundary whereas in the new model walking perpendicular to the decision boundary coincides with also walking perpendicular to the data manifold so this is the the difference right here that they that they claim so this they say there's um we call this conceptual framework the dimpled manifold model and note that it makes three testable claims about the kinds of decision boundaries created by trained deep neural networks first natural images are located in a k-dimensional manifold where k is much smaller than n second deep neural network decision boundaries pass very close to this image manifold and third the gradient of the classification's confidence level has a large norm and points roughly perpendicular to the image manifold all right so these are these are the claims that they're going to make to be tested and to be supported by experiments i guess so i hope i've represented enough what the authors claim right here i hope they would agree that i've represented this this accurately so now where is the problem with this in my opinion the problem isn't necessarily with what they claim right here um it's it's you know i don't necessarily disagree with this mental image i don't necessarily disagree with these claims in fact that the data is on low dimensional manifold this we've this is kind of commonly agreed upon assumption right as i said not all the possible pixels uh combinations make good natural images and that the fact that it is then a manifold is a commonly held assumption decision boundaries pass very close to the image manifold well the fact that we can generate adversarial examples right already means that decision boundaries pass very close to the image manifold so this also is not news this this has been like in everybody's conceptual framework for the last five years at least and then third the gradient of the classification's confidence level has a large norm and points roughly perpendicular to the image manifold and this claim right here i'm pretty pretty sure there so this is not a trivial claim um which yes okay this is not something that was like said around much however i'm going to claim that their model is not the only model by far that makes this happen or any something like this specifically when we go look at the experiments i'm going to show you that um this doesn't necessarily support their claims it doesn't disprove them right but it also doesn't necessarily support them just because they show that okay so the other problem i have with this is that this this thing they build up as ooh this is this is the old mental image this is how people thought about adversarial examples until now i look i jus i disagree like this it's a bit of a it's a bit of a straw man almost i feel like this no one no one thought no one that is sort of in the literature of adversarial examples thought or thinks that this is an appropriate model for what is happening like we know that these distances here are very small right the distance until you cross the decision boundary and we know also like if this were true you should just be able to go to the decision boundary and then go the same distance right and then at some point you would actually arrive at a sample of a different class so you could you could actually transform images into the other class by simply going into the adversarial direction which is precisely what we don't see right we see the image still largely looks the same what gets added looks like a bit of noise okay so no no one was having this mental image because clearly this mental image is is not appropriate for adversarial examples as well as saying look if you think of this in sort of higher dimensions um and i realize i've drawn this decision boundary but this is what they describe in the text um then i i don't i don't see that this is the correct way of like there are many different kinds of decision boundaries that are compatible with um with the decision boundary right here by the way this decision boundary i drew doesn't even separate the classes all the classes correctly what i'm saying is that also if you consider the decision boundary that for example looks like um out of colors looks like this that also crosses here however it's sort of kind of flat like this but it's still a linear decision boundary right um like this okay so this is above and the other part is below if you think of this if you project this down it looks the same in 2d and in 3d it also explains that decision boundaries are very close to the data samples it's a bit different though than this dimpled manifold hypothesis right if you i think the at least in my estimation what's happening is much more that you have just a bunch of these kind of linear uh decision boundaries flying around right here partitioning up the space and so on and this might result in a similar situation as here but it has quite different predictions in form of what it does than what it does right here here it's sort of a flat manifold dimpling around the data whereas here it's kind of the classifier separating the space into many regions always trying to sort of distinguish one class from the other and yeah so might end up bit the same but i don't think they give a fair shot at what we know so far like we that this model is not a a model that people hold in general especially the one on the left i can make an attempt at making a mental model that people hold so far maybe it's just me but i have a feeling this is a bit more so the model uh that i call let's call it something because they call there something right i call mine the squishy the stretchy feature model okay let's contrast this with the stretchy feature model so what i want to do is i have two features and this is a coordinate system in feature space okay so there's two features this in feature space i mean sort of the the last representation before the classification layer in feature space the two classes look like this so there is the red class and there is the blue class and you can see right here there are two features and for some reason the network can classify along these two features maybe because there are other classes other data points so we can't put a decision boundary like this between the two we can classify along the two features okay so you can see there are two features right here feature one and feature two and both features are actually pretty good features for keeping these two data points apart okay now there are empty spaces as you can see right here uh which we're gonna get to in a second but you can you can use both features and ideally a classifier would actually use both features it would say you know if feature one is high it's there probably a red class if feature two is low it's probably the red class and the combination makes even more of the red class however since we are in a deep neural network which is has transformations it transforms the data along the way if you look at the same situation in input space so in the actual pixel space it looks different and this is due to not necessarily the non-linearity of things but actually it is due to the the linear transformation it's actually the problem of adversarial examples at least in my estimation appears to happen in the linear layers if you think of for example like eigenvectors of matrices and the largest eigenvalues determine how far you can go in a particular direction by having a sort of a standard input delta and the same happens here by the way this is why spectral norm regularization tends to work at least a little bit against adversarial examples so what i mean is if you look at the scale of these features right they are like one two three four five of these features one two three four five if you look in the input space some of the features are going to have roughly the same scale right here and these features are going to be features that you have to change the input a lot in order to change the feature a lot what do i mean by this this is something like the shape of an of an image okay if you think of a cat uh the general shape of a cat you know it has it has two ears pointy it has a head and and so on that's the general shape of a cat um sorry that is actually the left right feature right this is the the left right feature is the shape and i have to change the input a lot in order to affect the feature right so they're roughly on the same scale of what i have to change to change the feature however the other the other feature in the input space has a much different scale than it has on in the feature space and this might be something like the fur structure of a cat so the first structure of a cat like is i can change the pixels a tiny bit and i'm going to change the first structure by a lot i can change the first structure of a cat to the first structure of a dog by just changing uh the by just changing the pixels a little however it will be different and now it will be the first structure of a dog so how does this change now in input space in input space it's going to look something like this where one feature dimension is going to look rather the same and the other feature direction is going to be very very stretched okay now remember both of these features are good features they both can be used to classify the images so you can see changing the shape requires a lot of pixels changing the first structure however requires just a little pixel now if i take some image and i draw an l2 ball around it which is what we usually do when we create an adversarial example we say only we only allow small to perturbations you can see that in in this direction it's a very you know you don't get very far in feature space but if you go the same distance in the in the input space into this direction in the feature space you're going to walk a lot you're going to walk like way far and this is just by definition there are going to be many features that you can use to classify images and they're going to be good features they're not going to be errors or aberrations like the first structure is a good feature to classify a cat they're going to be many features in there and some of them are going to be of large magnitude and some of them are going to be of small magnitude and this is just what happens okay so i call this the the the stretchy feature model and this is sort of a direct result of this paper that they cite by alexander madrid's group which we're going to get to in a second all right but keep those two in mind and we're going to see how which one explains the phenomena better and which one doesn't okay so they say why deep neural networks are likely to create dimpled manifolds as decision boundaries and the the idea here is that okay we have to now explain why this even happens so if you consider the data manifold in green right here and here we have just one dimensional data and you can see it's not linearly separable right so we have to have sort of a curve decision boundary around this um and why would this result in a dimpled manifold so they say look if you start off your your deep neural network training your maybe your decision boundary is going to be somewhere like here okay not very effective what's going to happen is let's say what you want what you want is you want to have the blue data you wanna have the blue data above and the red data below the decision boundary so right now the red data is is oh that's the other way around the red above and the blue below so right now the blue are fine like the blue don't complain you do get a gradient out of the red examples pushing the entire decision boundary down there's no resistance right the blue ones they're they're fine so you're gonna push down this is your next decision boundary okay same situation you're gonna push the entire decision boundary down now you're here now you're too far so you're gonna push the entire decision boundary up because now the red ones are fine the blue ones complain and this results you being sort of right on top of the data four once okay and then both gradients kick in so now the red data are gonna push such the decision boundary down the blue data are going to push the decision boundary up which is going to result in this sort of dimples around the data otherwise the decision boundary coinciding with the data okay this is their explanation for why the this why this works i hope this makes a little bit of sense now um yeah so they claim that that this is happening uh contrast this with the mental model of having a bunch of linear half spaces which would result in something like you know a decision boundary being through here a decision boundary being through here a decision boundary being through here and through here through here um which would also explain what we see but this is their claim why this decision boundary looks the way it is to me it's um it's a bit it's a bit weird right like here why should the decision boundary align with the data manifold maybe it doesn't maybe they don't they don't claim that i should not complain about this but for example in between the data why does it do that they give some examples right here the decision boundary it should be rather simple right it doesn't like to curve a lot they say the new model can help to understand why the training phase of a given network typically converges to the same global optimal placement of the decision boundary regardless of its random initialization they're going to make a claim right here why this happens to demonstrate this point consider the old model in which you sprinkle at random locations in the two-dimensional square alert as the large number of classes uh depicted in figure three sorry um i was confused for a second i am no longer so they're talking about this figure right here they say look in the old model you have if you want to pass sort of simple decision boundaries through this um you have to sort of pass them like some of the gray ones we see right here and they are not going to be so good okay so our goal is to pass a decision boundary of bounded complexity and this bounded complexity comes up again and again they claim of course their decision boundary is very smooth and very simple which will best separate the red and blue clusters they say there is a large number of way to do ways to do this like the green lines and most of them will be about equally bad in particular any decision to pass one side or the other of some cluster can make it harder to accommodate other clusters elsewhere along the line consequently there will likely be many local minima of roughly the same quality in the dimpled manifold model however there is likely to be a single globally best decision boundary shape since there is no conflict between our ability to go above one cluster and below a different cluster when they do not intersect so their idea here is that rather putting the decision boundaries like this what they want to do is you look at this in three dimensions and then they just kind of put a sheet over top of it and above the blue ones and they're below the red ones in all of the three dimensions right so you go above the blue ones and below the red ones rather than this these gray things like here which are not very optimal now this one i'm not really sure what to make of this because first of all uh they say it typically converges to the same global optimal placement of the decision boundary regardless of random initialization we we know that this is not true right um i've specifically made videos on research by stanislaw ford who shows that if you randomly initialize a network differently what will happen is you will reach the same accuracy but it will it will make mistakes on different samples of the test set right and there's actually a structure to how these decision boundaries are going to be different depending on your random initialization which actually would support what they claim is the old view right here second of all i have no trouble making a decision boundary here that separates red and blue right i can go something like this like this come here okay you get here right i have no trouble separating red and blue i guess this should go here um so there this this kind of this kind of bounded complexity does a lot of work here them saying oh the decision boundary should be simple and so on and that's why they insist that these decision boundaries should be somehow straight but then a lot but i disagree that their decision boundaries are so simple if you have to curve around every data sample and otherwise follow the image manifold like that seems to be like a rather complex decision boundary honestly um because it's it's it's kind of a generative model of the data right if you follow the the data manifold so i i disagree that theirs is so much simpler right just because it doesn't bend that much and here it like bends a lot that's also something they say like you you don't want to bend decision boundary so much that hardens training and third of all why do they give their model the benefit of the third dimension right so they claim like oh look the old model doesn't work because if you have to place division boundary between the data points um you're going to end up with a bad decision boundary however in order for their model to work they need the third dimension they need to pass like under and over the data in the third dimension whereas if you actually go into the third dimension you know every single lecture you have on kernelized svms and whatnot they show you like if you go in higher dimensions these things are actually separable like you would make if you have like rbf kernels these would become a cluster these would become a cluster and so on this is sort of the first lecture on going into higher dimensions in order to linearly classify stuff so it's not like their method can explain anything more than any other method if you give it this third dimension and the fact that they don't give the old model the third dimension but they give themselves the third dimension in order to explain it is a little bit i'm not i don't know it's like me um yeah so i i don't think this is any argument for for their model it just simply shows that if you have a lower dimensional manifold of data and you classify it in a higher dimension there are ways to do that right and if you like if you have relu networks and linear classifiers it's going to look like more chunky it's going to kind of divide the space into these kind of relu cells where you classify the data all of this is compatible with what they're saying not just their dimpled manifold hypothesis all right so this is yeah i don't i don't see the big explanation here so they claim what can they explain with their model explaining the mysteries of adversarial examples okay there are five things they claim they can explain with this first of all the mixture mystery right how can it be that a tiny distance away from any cat image there is also an image of a guacamole and vice versa um okay if these and if these classes are intertwined in such a fractal way how can a neural network correctly distinguish between them our answer is that all the real cat and guacamole images reside in on the tiny image manifold but below the real cut images there is a whole half space of pseudo guacamole images which are not natural images of guacamole and above the guacamole images there is a whole half space of pseudo cat images so their idea here is that okay you have this one dimensional data manifold here are the cats here the guacamoles if you have your dimpled manifold curving sort of around the data right here uh you know all of this is technically guacamole so if you go from the cat to here you reach a non-natural guacamole image just by the fact so the explanation here is that um the explanation is that this this the decision boundary lines up with the data manifold except around the data where it creates a small dimple and therefore you can cross the dimple into the other region okay you this is very it's the same effect as this model right here you know i can draw this dimpled manifold i can draw it right here right if i classify the image i can draw this dimpled manifold i get the same effect however this model here explains much more it actually explained like here there is no reason if you think about a multi-class setting right if you think of this in two classes fine but if you think of this in a multi-class setting there is no reason why this region right here should be guacamole it can be any other class right if the if the idea is the decision boundary follows the data manifold and then just dimples around the data to make the data correct declassify the only constraint here is is that these are cats it says nothing about sorry it says nothing about y on the other side there is guacamole instead of anything else and that does not coincide with what we know about adversarial examples like this region here is a consistent region what so first of all first of all my bigger problem is why does this even generalize why does the dimpled manifold hypothesis even generalize right like if it follows the if it follows the data manifold largely except around the the training data um why does it exactly generalize well to test data you have to like argue that the test data is here quite close because otherwise it would be it would get very confused on test data which would be somewhere else on the manifold right so but we know that generally neural networks classify data that's on the manifold of natural images quite well they generalize quite well um however this model is sort of an anti-generalization model but okay maybe you can claim that their test images are close enough to the training images such that this works but for example we know that if that um this this is a consistent region what do i mean by this we know for example we can make universal adversarial perturbations which means that we can find directions that no matter from which image or from which class we start from they will always result in guacamole this is not explained by the dimpled manifold there is no reason why these regions on the other side should be of a consistent label in a multi-class setting we also know that adversarial perturbations are transferable which means that we can make an adversarial perturbation in one um classifier and then in a different classifier even if it's trained with a different data set actually we can we can apply the same adversarial perturbation and it will most likely still be of the same like the adversarial perturbation going towards the same class there is no reason in the dimpled manifold hypothesis that explains these phenomena if you think of this of the stretchy feature model this is real easy right if i create an adversarial example um i go across the decision boundary right here what do i do i change the fur without changing the shape now i change the fur by so much that you know now there is a conflict right in feature space i go up here now there is a conflict it has the fur of a dog but the shape of a cat still now i there is a conflict but neural networks in the final layer are linear which means they just weigh the different features now i just pump that fur to be so dog-ish right that it overpowers the shape feature of the cat neural networks are biased towards sort of structure anyway over shape already so i just i just hammer that fur and now the neural network thinks it's it's a dog and a different neural network trained on the same data will also think it's a dog because it will also have learned to classify images by shape and fur therefore um therefore it will it will be vulnerable to the same attack right this is super easy to explain in this model there is no reason why this should happen in the dimpled manifold model unless you amend it by some more hand-wavy things they say the direction mystery when we use an adversarial attack to modify a cat into guacamole why doesn't the perturbation look green and mushy okay so they say well in the old model you would have to walk along the image manifold from here towards the guacamole images and that should mean that your image should sort of change to look like a guacamole in our mo in the dimple manifold model you go off the manifold perpendicular and that explains why the adversarial perturbation looks like a little bit like just random noise again no one thought this in the old model in fact we have a pretty good explanation why it still looks the same and that's because humans are much more receptive to this thing right here to the shape whereas neural networks also or much more consider this thing right here the fur so they consider fur and shape in different proportions than the humans uh do and so that's we already sort of knew this and it's in fact a better explanation um the uniformity mystery you know why the decision boundary is ever present um so they claim because the there's this dimple right here even you know the most far away cat image here has a close crossing to the decision boundary so there is no cat images that are kind of closer to the decision boundary but this is i think this is just a property of a high dimensional classifier i think that here our 2d view of the world betrays us and um yeah especially if we can go really far in feature space with a tiny perturbation and input space this is not not a mystery not even a mystery the vanishing gap mystery um okay which is about adversarially training i i think uh which we're gonna skip here and then there is the accuracy robustness trade-off mystery so this is if you do if you train a model adversarially which means that here look here i have my cat okay i train i have a data set of cats and dogs i train my neural network on it it's vulnerable what can i do what i can do is i can create adversarial images this is a cat right i can create adversarial images by making this into a dog okay so this is a dog because i changed the first structure a little bit this is an adversarial example now i add this so this is comes from the data set now i add this to the data set but i tell it this is a cat too right this is a cat and this is a cat if i do this with my neural network the neural network will become robust to adversarial examples to a degree not fully but to a degree this is the best method we have so far of defending against adversarial examples called adversarial training now what you do when you do this is you train the network to to sort of classify the yeah classify to incorporate the adversarialness into its decision making process and this results usually in a degradation of the generalization performance of the network so as it becomes more robust it becomes less accurate on real data right you gain accuracy on adversarial data you decrease the accuracy in real data which makes sense intuitively but it is a strong effect which is not the same as you know i simply teach my model to do yet another class um it is quite it is actually a a trade-off now they try to explain this um right here when we train a network we keep the images stationary and move to decision boundary by creating dimples when we create adversarial examples we keep the decision boundary stationary and move the images to the other side by allowing a large perpendicular derivative we make the training easier since we do not have to sharply bend decision boundary against around the training examples so this is when you train normally when you train without adversarial examples they say there is a large perpendicular derivative which um in the like the what they mean is that the data samples sort of push these dimples out that that's the large perpendicular derivative the perpendicularity is to the image manifold and that makes it easy because you don't have to bend the decision boundary a lot so you can kind of remain here and you have to kind of create these dimples again their argument is you don't want to bend this boundary a lot which makes training easy um however such a large derivative also creates very close adversarial examples yeah this is their claim that now the decision boundary is pretty close because you don't bend the decision boundary by too much around the data because you do dimples any attempts to robustify a network by limiting all its directional derivatives will make the network harder to train and thus less accurate i'm not super sure how to interpret this so i might be doing this wrong right here but if you create adversarial example what you do is you essentially have this data point and you create an adversarial example this statement is a well these are of the same class so now that this now the the decision boundary has to sort of bend harder okay which makes it uh more hard to train and at some point it so it's harder to train and that's why you have less accuracy and at some point it says well actually i don't want to bend that much i'd rather make a mistake here and just bend around both of these data points and now you have a wrong classification so that's sort of their explanation of why this happens which i find a bit hand wave you have to argue like ease of training bending the decision boundary and so on in this model right here super easy okay what happens if i create cats that have cat fur and doctor and i tell the network these both are cats well essentially i tell them i tell the network look there are two features right here the fur and the cat and you know the fur just just disregard it just don't do that don't regard the fur as a feature because it's useless now because i now have cats with cat fur and cat with dog fur so the network can't use that to classify anymore and that explains why it gets less accurate because i take away one useful feature okay so you know now the network has less useful features and that's why it gets worse this it's it's a pretty simple explanation in the stretchy feature model it has there's a lot of work to make this happen in the dimpled manifold model so lastly they try to explain and they became an interesting mystery um in this this paper that i have cited throughout and what that is is that it's kind of the same experiment as here where we create adversarial examples and we add them to the training set except for two things first of all we don't have the original so our new data set is not going to contain the original images it's only going to contain the adversarial examples second it is going to contain the adversarial example image but the label isn't going to be the correct label quote-unquote correct from where we created but the label is actually going to be the adversarial label the wrong label okay so we're going to tell the network this is a dog please learn that this is a dog right it's a cat with dog fur and the old training images are nowhere in the data set we just do a data set with these wrongly labeled images now when we go and we apply this so we train we use this we train a network right to classify cats and dogs and now we once we've trained this network we go we take one of these samples of the original data set we classify it it's going to give us a correct classification right so it will recognize that this here is a cat even though we told it that this here is a dog now how does it do this uh it does this by looking at the fur you know we've we've doubled down on the fur here right so this is like we we really made that fur feature super strong in these adversarial examples so it's going to look at the cat fur and even though none of the cats had the shape like this we sort of we sort of supercharged that fur feature again in this model not a problem essentially what we've done is we've created two data classes you know one up here and one down here that have the first supercharged and now it's just going to mainly look at that first structure and that is a useful feature right so this this what's called the features not bugs paper adversarial examples are features not bugs or other way around not bugs they are features um has demonstrated with this experiment this notion that there are adversarial examples result from useful generalizing features in the data set that are simply of by definition the features that are not large enough for humans to see what they call non-robust features how do they explain this they say the original people tried to explain this highly surprising role by distinguishing between robust and non-robust features in any given image where some of them are preserved by the adversarial change and some are not however it is not clear what makes some of the features more robust than others definition just definition like like if you have features and you order them by their size like by their how much you have to change the pixels that some features are going to be larger than other features and then some features going to be below that cutoff where you define adversarial examples but just this is definition makes them such that some of more robust it's not it's not clear our new model provides a very simple alternative explanation which does not necessarily contradict the original one okay at least this which is summarized in figure 4. to simplify the description we will use 2d vertical cut through the input space and consider only the decision boundary that separates between cats and anything else okay so they have this example right here they say look we have a decision boundary that distinguishes cats see from non-cats and the green one here is the image manifold and the gray is the decision boundary okay so now what we do is we create adversarial examples in frame two right here you can see we make the cuts into non-cuts and we make the b the bats into bats aren't very popular lately the badgers into into cats so we make the badgers into cats and we make the cats into the whatever d as ducks okay and now we relabel those and that gives us a new data manifold so the new data manifold is this data manifold right here and we have also new labels and now they claim the resulting decision boundary in figure four as you can see right here this is the resulting decision boundary the gray one it is it is very similar to the decision boundary in the first frame and therefore we shouldn't be surprised that this new decision boundary that results from this perturbed data results in the same decision boundary as the original one okay however um like why like why so their whole they have two notions notion one is that the decision boundary follows the data manifold closely except it sort of bends around the data a little and you can see this right here like this decision boundary kind of follows the data yet it just happens to be on the correct side of the data points um at any given moment which okay okay however they also make the claim in different parts of their paper that bending the decision boundary and so on is not good you'd rather want to have a simple decision boundary so to me there's no reason why the decision boundary couldn't just look like this it would correctly classify this new data set right however it would not correctly classify um it would not correctly classify the let's say the c that was right where was it right here or right here these data points would not correctly classify so you see that this until now they've always had this data manifold to be sort of super duper straight and smooth and that's how they can also say well following the data manifold and not bending too much and so on those are not in conflict with each other but now that they are in conflict with each other you have to give gonna give up one or the other and only in one of them do actually does this experiment here still make sense in the other one it doesn't and but if you give up the oh bending too much is bad then you know you lose a bunch of explanations that you have up here so yeah like it's one in my mind it's one or the other and there's there's still no reason i think no good reason why this like the decision boundary should align super closely with the data points like if there if there is nothing here right if this is perpendicular really to the data manifold like why would the decision boundary align so closely with the data manifold in that point i don't know okay so um they ask why are dnns so sensitive and humans so insensitive to adversarial perturbations essentially their argument here is that humans project the input data onto the image manifold which is a contested claim right i don't i don't think that is a uh i think that is not not a widely accepted i mean it's it's certainly possible uh but also i'm not sure i'm not sure that humans do project they have like an internal manifold of natural images and project onto that every time they analyze an image and also um also the yeah how do you project right like how like both of these features are useful okay so both of the features are useful if you project an adversarial example like why do you project it onto the shape dimension and not onto the fur dimension right why there's no explanation right here we know that sort of humans are more receptive to shapes and so on but just projecting won't get you there so now they're going to into experiments and i want to highlight one particular experiment right here they have synthetic experiments they have other experiments i want to highlight this experiment right here remember they said their experiments were going to give you know strong support that um in this experiment right here what they want to claim is that okay you have the data manifold here if you or if you have a data point and you make an adversarial example the question is um do adversarial examples go along the image manifold or do adversarial examples go sort of perpendicular to the image manifold they their claim again is that the this here would give support to the old view of adversarial examples and this here would support the dimpled manifold view because of course the decision boundary would be sort of following the data manifold curving around the data and then following the image manifold again so here would be sort of the other data point going below that a little bit all right so that is the view right here now what they're going to try to show you is that if you want to create an adversarial example on the manifold you have to walk much longer for much longer until you find an adversarial example than if you go off the manifold if you go yeah and they're also going to show you that if you're not constrained if you can go anywhere you want with an adversarial example then that will be very similar to when you force the adversarial example to go off the manifold and this gives a bit of proof that you know if two things behave equally they're you know probably equal so what they're going to do is they're going to try to make an adversarial attack first of all a regular one this one you're gonna say okay we're gonna make an adversarial attack let's measure how far we have to go to cross the decision boundary second they're going to say let's make the same thing but let's force the attack to be on the manifold of natural images and let's measure that and lastly they're going to mask okay let's do the same thing but force it to be off the data manifold and then they're going to measure how long these are how long the adversarial attacks are what's their their norm and they're going to find of course they're going to want to find that these two are about similar norms and way smaller than the one that is on the data manifold sort of giving evidence to you know if you go perpendicular to the data manifold you have to go very not very far and that's what adversarial attacks do okay uh yeah so how first of all how do they force the the adversarial attack to be on the manifold um what they do is they do an auto encoder so they train an auto encoder so they an autoencoder is a neural network that has sort of a bottleneck layer and you try to just reconstruct the input data okay you tried that these two are equal however in the middle here you have a very low dimensional representation so where this is an n dimensional representation this is a k dimensional representation and a k much smaller than n if you can reconstruct the images correctly that means that you sort of have captured the representation in these low dimensions right here so what they're going to do is they train an auto encoder they take that low dimensional representation they linearize around it and that's how they have a way to project onto the image manifold by simply only moving around in this low dimensional manifold right here or always projecting onto it first of all it's a bit of a trouble because how you train the autoencoder is like for these experiments i think it's very relevant to how the this image manifold is going to look like if you train it with l2 you sort of already make some claims about what are important features and what not but let's disregard this right here let's say they have an accurate way of projecting onto the image manifold onto the manifold of natural data and here's what they find look let's look at imagenet okay no constraint pgd it this is the norm you know it's some number okay so like 0.14 now off manifold pgd is where they deliberately project off the manifold so they project on the manifold they subtract that they say you're not to do anything with the mana the image manifold and that's 0.152 which is slightly larger than the no constraint pgd but essentially the same size now on manifold pgd okay here is a way bigger number like six times bigger number so their claim is look up up to six times uh more you have to go on the manifold than off the manifold and that gives credence to their claims now okay so what i've done is they have you know they have some descriptions of their experiment specifically they have descriptions of what library they use they used adword torch okay so i used advertorch2 uh they used you know l2 pgd i used that too and they told me how much their low dimensional representation is so the k here how much that is how much the n is and so i was able to reproduce that experiment now what i've done is i have done the same thing and you can see right here this is this the panda image from imagenet they used an imagenet classifier and what they do is they do it greedy so they stop as soon as they cross the decision boundary and then they measure the norm you can see right here this is the perturbation now it's a soccer ball and here is the size 0.7772 that's the norm of the original perturbation adversarial what i now do is i project onto the manifold but i don't the difference is i don't project onto the image manifold what i do is here you see project on 2k i simply project onto any k-dimensional manifold so i know what k is k is 3 500. um so it's a very small number compared to the input number and so what they project is actually the gradient so the gradient of the adversarial attack that you use to update your image that's what they project they have the algorithm clearly lined out so what i do is i simply take you can see right here i take a random set of of dimensions like of of pixel coordinates in the gradient and i denote the first you know the first few the first k as the manifold and the last k as not the manifold this is not the image manifold there's nothing to do with the image manifold this is simply a random k-dimensional subspace of the pixel space okay and now when i project onto k i simply take all the others in the gradient and i set them to zero that's i project onto a k-dimensional manifold after that you normalize the gradient and so on so um you proceed you proceed as you would right so here you can see the the project is used before you normalize the gradient so there's no issue with sort of the the step size you simply project onto the manifold and i have the same thing by the way projecting off the manifold where i simply take the the k diamond dimensions and set them to zero okay so now let's look what happens if i project onto the manifold oh wow before it was 0.77 and now it's 6.5 so about 8 times larger and now let's look what happens if i project off the manifold it's 0.7773 instead of 0.7772 so what they're seeing right here and you know maybe okay maybe i've done it modulo i've done it wrong and i completely don't understand what's going on um what they have found is simply an effect of projecting onto any lower dimensional space yet they claim that this is like in support of their hypothesis which clearly i have no clue what the day to manifold is i've just projected onto a random manifold and i got the same results um so i see they have other experiments where they try to kind of convince you with all the types of perturbations and so on but you know like no this these they have other experiments but this is just one that i could try quickly again maybe i've done it wrong to me this occam's razor is strong here like occam's razor in this work is quite a bit like there can be like there can be many hypotheses that coincide with the results you're getting and with the phenomena and it's easy to to think that stuff is in favor of your hypothesis is providing support for it um when there are other explanations available oh i almost forgot about uh goodfellow's claim that you know they say belongs to the sort of old thinking that is now that is not a correct thinking and the claim that when you make an adversarial example you somehow go towards the centroid of a different class and this in imagination it's something like this on the on the left right here however if you think about this in this space okay let's say you start out here and you go towards the centroid of the other class right the pro like where's the center right here approximately like this what happens in feature space because of the stretchy feature because of the different scales okay what happens in feature space is it pretty much like the blue arrow here so it's that in feature space you go a long way actually this is probably i should have drawn this here to be square and this here to be super stretchy right yeah yeah i think so yeah i was i was wrong in drawing this so this here should be squares and this here actually should be super duper stretchy right so the centroid what was the centroid here is like way up here like way up here somewhere okay so this gets super stretched and you cross the boundary in this one feature right like the fur feature and um yeah so i think this is it's still a correct claim you go towards the centroid of another class but because you go this in input space um in the feature space this results in sort of a dramatic shift in some features and a not so dramatic shift in other features so while in the input space you go towards the centroid equally in all pixel directions you don't go towards the centroid equally in all pixel directions in the sorry in all feature directions so i think the claim the good fellow made is valid here still and explains like is concurrent with the stretchy feature explanation i'm pretty sure that's also kind of what maybe i can't read his mind but maybe what he meant by that and not necessarily this picture right here not necessarily that actually the entire picture is going to change into the other class okay that was the interjection and back to the conclusion but as i said make up your own mind what do you what do you think of this um go through the paper they it's it's a good paper like it's written it's written well uh it has a lot of experiments has quite a lot of appendix where they give you more results and so on and it's not like again it's not like it's in it's necessarily incompatible right it's not i don't disagree with them i just think it's it's not as useful as they claim and it's kind of insufficient i don't disagree with their their main claims um yeah and i think we already kind of knew a lot of those stuff and our current mental models are explaining the things uh maybe a little a little better and yeah if you use the the squishy feature what do they call it the the stretchy feature model has a fancy name now but again this this is not mine this is just kind of a a uh bringing together of of what we what i think we know of about adversarial examples uh safe to say there's going to be something that challenges this and that's going to be exciting all right thanks so much for being here listening and i'll see you next time bye byehello there today we're going to look at the dimpled manifold model of adversarial examples in machine learning by adi shamir odalia melamed and oriel ben shmuel this paper on a high level proposes a new way of looking at the phenomenon of adversarial examples in machine learning specifically in deep learning and they proposed this model called the dimpled manifold model essentially arguing that classifiers put their decision boundaries right next to the manifold of data while only slightly sort of curving it around the data like this now the data manifold being low dimensional this results in a situation where you can cross the decision boundary really easily if you simply go perpendicular to the data manifold which also is perpendicular to the decision boundary and if because it's just such a small dimple there uh the decision boundary is pretty close and that's how you end up with adversarial examples that are super easy to get so it's not a new attack a new defense anything like this it's simply a mental framework of explaining why adversarial examples exist on a high level they have some conceptual thought experiments uh they have um some explanations and some real world experiments now i personally don't think that this is entirely it's not necessarily incorrect but i don't think that this is really useful to think in this way and i'm gonna explain why in general my opinion of this is it doesn't really add anything and i think it explains less than the models we already had um yeah so that's that's my opinion i'm gonna get to it specifically also the experiments they propose i think that there is a big occam's razor failure right there but as i said we're gonna get to all of this and go through the paper and i want you to make up your own mind even though i'm going to try to bias you so uh yeah this is this is not a neutral channel in case you haven't noticed all right so if you you know like content or if you dislike it tell me in the comments tell me what you think of the paper whether it makes sense whether it doesn't make sense and so on i'd be very interested to see what you have to say uh yeah i read the comments so please they say the extreme fragility of deep neural networks when presented with tiny perturbations okay this starts out how every single adversarial example's paper always starts out saying okay deep neural networks are extremely fragile there's this phenomenon of adversarial examples now if you don't know what adversarial examples are really briefly essentially what this is it's a phenomenon where you take an image like the thing here on the left uh the neural network thinks it's a plane with a very high probability and you change it to this thing right here which you as a human can't even tell it's different however the neural network will think that this is now a bird with very high probability and the this is the change that you made uh it's magnified for you to see it kind of looks like random noise but it's a very particular noise that makes the neural network think it's something different and this is just it's tiny in the in its norm right so you don't see a difference now bird here is kind of close to plane but you can change this into anything literally anything you want you can change this into banana or uh i don't know dog or any class you want using these techniques so it's not about being close it's really kind of a separate phenomenon so that's adversarial examples and many frameworks have been proposed in order to explain these adversarial examples and they make a they make a nice overview right here um many had been proposed over the last 18 last eight years that dnns are too non-linear that they're too linear that they were trained with insufficient number of training examples that are just rare cases where they err that images contain robust and non-robust features etc they say however none of these vague qualitative ideas seem to provide a simple intuitive explanations for the existence and bizarre properties of adversarial examples so that is pretty harsh criticism specifically the first ones are kind of yeah but specifically this last one that images contain robust and non-robust features which is sort of the leading hypothesis right now of why adversarial examples exist and what they are and them here saying none of these can none of these vague qualitative ideas seem to provide a simple intuitive explanation for the existence like let's see whether or not they're gonna do better okay so um also in the abstract they go on and they say okay they introduced this new conceptual framework which they call the dimpled manifold model which provides a simple explanation for why adversarial examples exist why their perturbations have such tiny norms why these perturbations look like random noise and why a network which was adversarially trained with incorrectly labeled images can still correctly classify test images now this last part if you're not familiar with the literature it might come to you a bit random this why network which was adversarially trained with incorrectly labeled images can still correctly classify test images this is a famous experiment from the group of alexander madri where also this this hypothesis this one the robust and non-robust feature comes from and any any attempt at explaining adversarial examples after this paper has to explain why that experiment makes sense because it's kind of a non-intuitive experiment and we're gonna get to that as well but just so you know that's why they write it in the abstract now i personally think they don't have a good like this model here doesn't have a good explanation for why that works uh they're sort of hand-wavy trying in any case so they say in in the last part of the paper we describe the results of numerous experiments which strongly support this new model and in particular our assertion that adversarial perturbations are roughly perpendicular to the low dimensional manifold which contains all the training examples okay also remember this experiment they strongly support what in particular the assertion that adversarial perturbations are roughly perpendicular to the low dimensional manifold which contains all the training examples now remember this that the experiments are supposed to support this particular claim because also that is going to be important down the road okay so let's get into the dimpled manifold model what is it what do these authors propose and i'm going to try as best as i can to say what the authors are saying in the paper so they claim that there is an old mental image of adversarial examples and the old mental image is um is here it's uh they say we think the old mental image is based on the highly misleading 2d image on the left side of figure 1 and that's this thing right here so the old mental image is that there's a there is a data space right this here if you think of pic of images as data points this would be the pixel space right so this is images with two pixels right now in this conceptual framework but you have to sort of think yourself into higher dimension so they claim the old mental images the following you have sort of the data distributed somehow in this space the data being the all the set of natural images or images you consider which is kind of these these subspace these subgroups right here there are a bunch of images right there and there and also there and there so these are images of two different classes the red class and the blue class now they're distributed like this and what is a classifier supposed to do a classifier is supposed to put a decision boundary between them and that's what they draw in here so this would be sort of a reasonable decision boundary between the two classes right so now what do you do if you want to create an adversarial examples well necessarily you have to start at an image of a class this one maybe and you have to cross the decision boundary right you want to fool the classifier ergo necessarily by definition you have to cross the decision boundary so what do you do the the easiest way to do this is to sort of go straight towards the decision boundary which is approximately in this direction right here and then once you cross the decision boundary you are done you're on the other side you have created an adversarial example provided of course that the image still kind of looks like the original image okay so they say this has this has many many problems here they say the in this mental this mental image adversarial examples are created by moving the given images along the green arrows towards some kind of centroid of the nearest training images with the opposite label in which they mean this this thing right here so we would move the images towards the other class towards the images of the other class and they say as stated for example by ian goodfellow in his lecture at this time i'm going to cut this in right here i've i've said that the same perturbation can fool many different models or the same perturbation can be applied to many different clean examples i've also said that the subspace of adversarial perturbations is only about 50 dimensional even if the input dimension is 3000 dimensional so how is it that these subspaces intersect the reason is that the choice of the subspace directions is not completely random it's generally going to be something like pointing from one class centroid to another class centroid and if you look at that vector and visualize it as an image it might not be meaningful to a human just because humans aren't very good at imagining what class centroids look like and we're really bad at imagining differences between centroids but there is more or less this systematic effect that causes different models to learn similar linear functions just because they're trying to solve the same task okay so it really appears like goodfellow says this thing right here however they claim now they claim this doesn't make sense so they claim that you should think about adversarial examples in a different way and this is their dimpled manifold hypothesis so what is their dimpled manifold hypothesis they say what you have to do is you have to think about the data manifold in the higher dimensional space that they have the higher dimensional input space so in this case they consider instead of here this 2d landscape they consider the 3d landscape so this would be the pixel space right now we consider three pixel images and the data is embedded in a low dimensional manifold in this higher space so because if you think about all combinations of pixels that are possible so not all of them are natural images in fact only very few of the possible combinations of pixels are natural images or images that make you know sense to you as a human or are images that you could potentially generate by going out with a camera so the data you're considering lives on a very low dimensional manifold in this big space and you have to explicitly think about that now the data is the data manifold here is represented in this in this sheet in the middle and on this manifold you're going to have your different classes of of data here the blue or one class and the red are the other class what this paper claims is that what classifiers do what neural networks do when they classify the training data here is they go and they lay their decision boundary instead of so in the old model you would have thought maybe something like this happened where you put your decision boundary sort of in the middle between the two classes right crossing the manifold right here so you sort of put it in the middle between the two classes and then when you have to create an adversarial example again what you would do is you would maybe start here what you have to do is you would go straight towards the decision boundary right here okay crossing the decision boundary and then on the other side you'd have an adversarial example in this new model what they claim is the decision boundary actually doesn't look like this right here okay the decision boundary actually is very much aligned with the manifold of data as you can see right here so this mesh that they show is the decision boundary now and their claim is that that usually just aligns with the manifold of data however um around the actual data around the training samples what the classifier will do is it will create these what these dimples okay and these dimples are just tiny well dimples tiny perturbations in the decision manifold such that the data is on the correct side of the decision manifold sorry of the decision boundary right so the blue points here are under or one side of the decision boundary and the red points are on the other side of the decision boundary and for the rest the decision boundary just aligns with the data the data manifold now if you want to make an adversarial example now what you have to do again you start from an image and again you walk straight towards the decision boundary however now you don't have to go um like this you so what you can do is you can go simply perpendicular to the data manifold and you will cross the decision boundary very quickly because the dimple you're in is is kind of shallow and they give a reason why the dimples are shallow because they claim this is um results from training these models and that explains some things so the difference is the difference is we started out from this to make an adversarial example we have to go towards the decision boundary okay if we sort of transfer this image into higher dimensions it looks like this in the middle again in order to make an adversarial example we have to go towards the decision boundary now in the old mental image going perpendicular to the decision boundary means walking on the data manifold because we walk from this group of data towards this group of data okay you can see right here that we're walking on the data manifold when we walk perpendicular to the decision boundary whereas in the new model walking perpendicular to the decision boundary coincides with also walking perpendicular to the data manifold so this is the the difference right here that they that they claim so this they say there's um we call this conceptual framework the dimpled manifold model and note that it makes three testable claims about the kinds of decision boundaries created by trained deep neural networks first natural images are located in a k-dimensional manifold where k is much smaller than n second deep neural network decision boundaries pass very close to this image manifold and third the gradient of the classification's confidence level has a large norm and points roughly perpendicular to the image manifold all right so these are these are the claims that they're going to make to be tested and to be supported by experiments i guess so i hope i've represented enough what the authors claim right here i hope they would agree that i've represented this this accurately so now where is the problem with this in my opinion the problem isn't necessarily with what they claim right here um it's it's you know i don't necessarily disagree with this mental image i don't necessarily disagree with these claims in fact that the data is on low dimensional manifold this we've this is kind of commonly agreed upon assumption right as i said not all the possible pixels uh combinations make good natural images and that the fact that it is then a manifold is a commonly held assumption decision boundaries pass very close to the image manifold well the fact that we can generate adversarial examples right already means that decision boundaries pass very close to the image manifold so this also is not news this this has been like in everybody's conceptual framework for the last five years at least and then third the gradient of the classification's confidence level has a large norm and points roughly perpendicular to the image manifold and this claim right here i'm pretty pretty sure there so this is not a trivial claim um which yes okay this is not something that was like said around much however i'm going to claim that their model is not the only model by far that makes this happen or any something like this specifically when we go look at the experiments i'm going to show you that um this doesn't necessarily support their claims it doesn't disprove them right but it also doesn't necessarily support them just because they show that okay so the other problem i have with this is that this this thing they build up as ooh this is this is the old mental image this is how people thought about adversarial examples until now i look i jus i disagree like this it's a bit of a it's a bit of a straw man almost i feel like this no one no one thought no one that is sort of in the literature of adversarial examples thought or thinks that this is an appropriate model for what is happening like we know that these distances here are very small right the distance until you cross the decision boundary and we know also like if this were true you should just be able to go to the decision boundary and then go the same distance right and then at some point you would actually arrive at a sample of a different class so you could you could actually transform images into the other class by simply going into the adversarial direction which is precisely what we don't see right we see the image still largely looks the same what gets added looks like a bit of noise okay so no no one was having this mental image because clearly this mental image is is not appropriate for adversarial examples as well as saying look if you think of this in sort of higher dimensions um and i realize i've drawn this decision boundary but this is what they describe in the text um then i i don't i don't see that this is the correct way of like there are many different kinds of decision boundaries that are compatible with um with the decision boundary right here by the way this decision boundary i drew doesn't even separate the classes all the classes correctly what i'm saying is that also if you consider the decision boundary that for example looks like um out of colors looks like this that also crosses here however it's sort of kind of flat like this but it's still a linear decision boundary right um like this okay so this is above and the other part is below if you think of this if you project this down it looks the same in 2d and in 3d it also explains that decision boundaries are very close to the data samples it's a bit different though than this dimpled manifold hypothesis right if you i think the at least in my estimation what's happening is much more that you have just a bunch of these kind of linear uh decision boundaries flying around right here partitioning up the space and so on and this might result in a similar situation as here but it has quite different predictions in form of what it does than what it does right here here it's sort of a flat manifold dimpling around the data whereas here it's kind of the classifier separating the space into many regions always trying to sort of distinguish one class from the other and yeah so might end up bit the same but i don't think they give a fair shot at what we know so far like we that this model is not a a model that people hold in general especially the one on the left i can make an attempt at making a mental model that people hold so far maybe it's just me but i have a feeling this is a bit more so the model uh that i call let's call it something because they call there something right i call mine the squishy the stretchy feature model okay let's contrast this with the stretchy feature model so what i want to do is i have two features and this is a coordinate system in feature space okay so there's two features this in feature space i mean sort of the the last representation before the classification layer in feature space the two classes look like this so there is the red class and there is the blue class and you can see right here there are two features and for some reason the network can classify along these two features maybe because there are other classes other data points so we can't put a decision boundary like this between the two we can classify along the two features okay so you can see there are two features right here feature one and feature two and both features are actually pretty good features for keeping these two data points apart okay now there are empty spaces as you can see right here uh which we're gonna get to in a second but you can you can use both features and ideally a classifier would actually use both features it would say you know if feature one is high it's there probably a red class if feature two is low it's probably the red class and the combination makes even more of the red class however since we are in a deep neural network which is has transformations it transforms the data along the way if you look at the same situation in input space so in the actual pixel space it looks different and this is due to not necessarily the non-linearity of things but actually it is due to the the linear transformation it's actually the problem of adversarial examples at least in my estimation appears to happen in the linear layers if you think of for example like eigenvectors of matrices and the largest eigenvalues determine how far you can go in a particular direction by having a sort of a standard input delta and the same happens here by the way this is why spectral norm regularization tends to work at least a little bit against adversarial examples so what i mean is if you look at the scale of these features right they are like one two three four five of these features one two three four five if you look in the input space some of the features are going to have roughly the same scale right here and these features are going to be features that you have to change the input a lot in order to change the feature a lot what do i mean by this this is something like the shape of an of an image okay if you think of a cat uh the general shape of a cat you know it has it has two ears pointy it has a head and and so on that's the general shape of a cat um sorry that is actually the left right feature right this is the the left right feature is the shape and i have to change the input a lot in order to affect the feature right so they're roughly on the same scale of what i have to change to change the feature however the other the other feature in the input space has a much different scale than it has on in the feature space and this might be something like the fur structure of a cat so the first structure of a cat like is i can change the pixels a tiny bit and i'm going to change the first structure by a lot i can change the first structure of a cat to the first structure of a dog by just changing uh the by just changing the pixels a little however it will be different and now it will be the first structure of a dog so how does this change now in input space in input space it's going to look something like this where one feature dimension is going to look rather the same and the other feature direction is going to be very very stretched okay now remember both of these features are good features they both can be used to classify the images so you can see changing the shape requires a lot of pixels changing the first structure however requires just a little pixel now if i take some image and i draw an l2 ball around it which is what we usually do when we create an adversarial example we say only we only allow small to perturbations you can see that in in this direction it's a very you know you don't get very far in feature space but if you go the same distance in the in the input space into this direction in the feature space you're going to walk a lot you're going to walk like way far and this is just by definition there are going to be many features that you can use to classify images and they're going to be good features they're not going to be errors or aberrations like the first structure is a good feature to classify a cat they're going to be many features in there and some of them are going to be of large magnitude and some of them are going to be of small magnitude and this is just what happens okay so i call this the the the stretchy feature model and this is sort of a direct result of this paper that they cite by alexander madrid's group which we're going to get to in a second all right but keep those two in mind and we're going to see how which one explains the phenomena better and which one doesn't okay so they say why deep neural networks are likely to create dimpled manifolds as decision boundaries and the the idea here is that okay we have to now explain why this even happens so if you consider the data manifold in green right here and here we have just one dimensional data and you can see it's not linearly separable right so we have to have sort of a curve decision boundary around this um and why would this result in a dimpled manifold so they say look if you start off your your deep neural network training your maybe your decision boundary is going to be somewhere like here okay not very effective what's going to happen is let's say what you want what you want is you want to have the blue data you wanna have the blue data above and the red data below the decision boundary so right now the red data is is oh that's the other way around the red above and the blue below so right now the blue are fine like the blue don't complain you do get a gradient out of the red examples pushing the entire decision boundary down there's no resistance right the blue ones they're they're fine so you're gonna push down this is your next decision boundary okay same situation you're gonna push the entire decision boundary down now you're here now you're too far so you're gonna push the entire decision boundary up because now the red ones are fine the blue ones complain and this results you being sort of right on top of the data four once okay and then both gradients kick in so now the red data are gonna push such the decision boundary down the blue data are going to push the decision boundary up which is going to result in this sort of dimples around the data otherwise the decision boundary coinciding with the data okay this is their explanation for why the this why this works i hope this makes a little bit of sense now um yeah so they claim that that this is happening uh contrast this with the mental model of having a bunch of linear half spaces which would result in something like you know a decision boundary being through here a decision boundary being through here a decision boundary being through here and through here through here um which would also explain what we see but this is their claim why this decision boundary looks the way it is to me it's um it's a bit it's a bit weird right like here why should the decision boundary align with the data manifold maybe it doesn't maybe they don't they don't claim that i should not complain about this but for example in between the data why does it do that they give some examples right here the decision boundary it should be rather simple right it doesn't like to curve a lot they say the new model can help to understand why the training phase of a given network typically converges to the same global optimal placement of the decision boundary regardless of its random initialization they're going to make a claim right here why this happens to demonstrate this point consider the old model in which you sprinkle at random locations in the two-dimensional square alert as the large number of classes uh depicted in figure three sorry um i was confused for a second i am no longer so they're talking about this figure right here they say look in the old model you have if you want to pass sort of simple decision boundaries through this um you have to sort of pass them like some of the gray ones we see right here and they are not going to be so good okay so our goal is to pass a decision boundary of bounded complexity and this bounded complexity comes up again and again they claim of course their decision boundary is very smooth and very simple which will best separate the red and blue clusters they say there is a large number of way to do ways to do this like the green lines and most of them will be about equally bad in particular any decision to pass one side or the other of some cluster can make it harder to accommodate other clusters elsewhere along the line consequently there will likely be many local minima of roughly the same quality in the dimpled manifold model however there is likely to be a single globally best decision boundary shape since there is no conflict between our ability to go above one cluster and below a different cluster when they do not intersect so their idea here is that rather putting the decision boundaries like this what they want to do is you look at this in three dimensions and then they just kind of put a sheet over top of it and above the blue ones and they're below the red ones in all of the three dimensions right so you go above the blue ones and below the red ones rather than this these gray things like here which are not very optimal now this one i'm not really sure what to make of this because first of all uh they say it typically converges to the same global optimal placement of the decision boundary regardless of random initialization we we know that this is not true right um i've specifically made videos on research by stanislaw ford who shows that if you randomly initialize a network differently what will happen is you will reach the same accuracy but it will it will make mistakes on different samples of the test set right and there's actually a structure to how these decision boundaries are going to be different depending on your random initialization which actually would support what they claim is the old view right here second of all i have no trouble making a decision boundary here that separates red and blue right i can go something like this like this come here okay you get here right i have no trouble separating red and blue i guess this should go here um so there this this kind of this kind of bounded complexity does a lot of work here them saying oh the decision boundary should be simple and so on and that's why they insist that these decision boundaries should be somehow straight but then a lot but i disagree that their decision boundaries are so simple if you have to curve around every data sample and otherwise follow the image manifold like that seems to be like a rather complex decision boundary honestly um because it's it's it's kind of a generative model of the data right if you follow the the data manifold so i i disagree that theirs is so much simpler right just because it doesn't bend that much and here it like bends a lot that's also something they say like you you don't want to bend decision boundary so much that hardens training and third of all why do they give their model the benefit of the third dimension right so they claim like oh look the old model doesn't work because if you have to place division boundary between the data points um you're going to end up with a bad decision boundary however in order for their model to work they need the third dimension they need to pass like under and over the data in the third dimension whereas if you actually go into the third dimension you know every single lecture you have on kernelized svms and whatnot they show you like if you go in higher dimensions these things are actually separable like you would make if you have like rbf kernels these would become a cluster these would become a cluster and so on this is sort of the first lecture on going into higher dimensions in order to linearly classify stuff so it's not like their method can explain anything more than any other method if you give it this third dimension and the fact that they don't give the old model the third dimension but they give themselves the third dimension in order to explain it is a little bit i'm not i don't know it's like me um yeah so i i don't think this is any argument for for their model it just simply shows that if you have a lower dimensional manifold of data and you classify it in a higher dimension there are ways to do that right and if you like if you have relu networks and linear classifiers it's going to look like more chunky it's going to kind of divide the space into these kind of relu cells where you classify the data all of this is compatible with what they're saying not just their dimpled manifold hypothesis all right so this is yeah i don't i don't see the big explanation here so they claim what can they explain with their model explaining the mysteries of adversarial examples okay there are five things they claim they can explain with this first of all the mixture mystery right how can it be that a tiny distance away from any cat image there is also an image of a guacamole and vice versa um okay if these and if these classes are intertwined in such a fractal way how can a neural network correctly distinguish between them our answer is that all the real cat and guacamole images reside in on the tiny image manifold but below the real cut images there is a whole half space of pseudo guacamole images which are not natural images of guacamole and above the guacamole images there is a whole half space of pseudo cat images so their idea here is that okay you have this one dimensional data manifold here are the cats here the guacamoles if you have your dimpled manifold curving sort of around the data right here uh you know all of this is technically guacamole so if you go from the cat to here you reach a non-natural guacamole image just by the fact so the explanation here is that um the explanation is that this this the decision boundary lines up with the data manifold except around the data where it creates a small dimple and therefore you can cross the dimple into the other region okay you this is very it's the same effect as this model right here you know i can draw this dimpled manifold i can draw it right here right if i classify the image i can draw this dimpled manifold i get the same effect however this model here explains much more it actually explained like here there is no reason if you think about a multi-class setting right if you think of this in two classes fine but if you think of this in a multi-class setting there is no reason why this region right here should be guacamole it can be any other class right if the if the idea is the decision boundary follows the data manifold and then just dimples around the data to make the data correct declassify the only constraint here is is that these are cats it says nothing about sorry it says nothing about y on the other side there is guacamole instead of anything else and that does not coincide with what we know about adversarial examples like this region here is a consistent region what so first of all first of all my bigger problem is why does this even generalize why does the dimpled manifold hypothesis even generalize right like if it follows the if it follows the data manifold largely except around the the training data um why does it exactly generalize well to test data you have to like argue that the test data is here quite close because otherwise it would be it would get very confused on test data which would be somewhere else on the manifold right so but we know that generally neural networks classify data that's on the manifold of natural images quite well they generalize quite well um however this model is sort of an anti-generalization model but okay maybe you can claim that their test images are close enough to the training images such that this works but for example we know that if that um this this is a consistent region what do i mean by this we know for example we can make universal adversarial perturbations which means that we can find directions that no matter from which image or from which class we start from they will always result in guacamole this is not explained by the dimpled manifold there is no reason why these regions on the other side should be of a consistent label in a multi-class setting we also know that adversarial perturbations are transferable which means that we can make an adversarial perturbation in one um classifier and then in a different classifier even if it's trained with a different data set actually we can we can apply the same adversarial perturbation and it will most likely still be of the same like the adversarial perturbation going towards the same class there is no reason in the dimpled manifold hypothesis that explains these phenomena if you think of this of the stretchy feature model this is real easy right if i create an adversarial example um i go across the decision boundary right here what do i do i change the fur without changing the shape now i change the fur by so much that you know now there is a conflict right in feature space i go up here now there is a conflict it has the fur of a dog but the shape of a cat still now i there is a conflict but neural networks in the final layer are linear which means they just weigh the different features now i just pump that fur to be so dog-ish right that it overpowers the shape feature of the cat neural networks are biased towards sort of structure anyway over shape already so i just i just hammer that fur and now the neural network thinks it's it's a dog and a different neural network trained on the same data will also think it's a dog because it will also have learned to classify images by shape and fur therefore um therefore it will it will be vulnerable to the same attack right this is super easy to explain in this model there is no reason why this should happen in the dimpled manifold model unless you amend it by some more hand-wavy things they say the direction mystery when we use an adversarial attack to modify a cat into guacamole why doesn't the perturbation look green and mushy okay so they say well in the old model you would have to walk along the image manifold from here towards the guacamole images and that should mean that your image should sort of change to look like a guacamole in our mo in the dimple manifold model you go off the manifold perpendicular and that explains why the adversarial perturbation looks like a little bit like just random noise again no one thought this in the old model in fact we have a pretty good explanation why it still looks the same and that's because humans are much more receptive to this thing right here to the shape whereas neural networks also or much more consider this thing right here the fur so they consider fur and shape in different proportions than the humans uh do and so that's we already sort of knew this and it's in fact a better explanation um the uniformity mystery you know why the decision boundary is ever present um so they claim because the there's this dimple right here even you know the most far away cat image here has a close crossing to the decision boundary so there is no cat images that are kind of closer to the decision boundary but this is i think this is just a property of a high dimensional classifier i think that here our 2d view of the world betrays us and um yeah especially if we can go really far in feature space with a tiny perturbation and input space this is not not a mystery not even a mystery the vanishing gap mystery um okay which is about adversarially training i i think uh which we're gonna skip here and then there is the accuracy robustness trade-off mystery so this is if you do if you train a model adversarially which means that here look here i have my cat okay i train i have a data set of cats and dogs i train my neural network on it it's vulnerable what can i do what i can do is i can create adversarial images this is a cat right i can create adversarial images by making this into a dog okay so this is a dog because i changed the first structure a little bit this is an adversarial example now i add this so this is comes from the data set now i add this to the data set but i tell it this is a cat too right this is a cat and this is a cat if i do this with my neural network the neural network will become robust to adversarial examples to a degree not fully but to a degree this is the best method we have so far of defending against adversarial examples called adversarial training now what you do when you do this is you train the network to to sort of classify the yeah classify to incorporate the adversarialness into its decision making process and this results usually in a degradation of the generalization performance of the network so as it becomes more robust it becomes less accurate on real data right you gain accuracy on adversarial data you decrease the accuracy in real data which makes sense intuitively but it is a strong effect which is not the same as you know i simply teach my model to do yet another class um it is quite it is actually a a trade-off now they try to explain this um right here when we train a network we keep the images stationary and move to decision boundary by creating dimples when we create adversarial examples we keep the decision boundary stationary and move the images to the other side by allowing a large perpendicular derivative we make the training easier since we do not have to sharply bend decision boundary against around the training examples so this is when you train normally when you train without adversarial examples they say there is a large perpendicular derivative which um in the like the what they mean is that the data samples sort of push these dimples out that that's the large perpendicular derivative the perpendicularity is to the image manifold and that makes it easy because you don't have to bend the decision boundary a lot so you can kind of remain here and you have to kind of create these dimples again their argument is you don't want to bend this boundary a lot which makes training easy um however such a large derivative also creates very close adversarial examples yeah this is their claim that now the decision boundary is pretty close because you don't bend the decision boundary by too much around the data because you do dimples any attempts to robustify a network by limiting all its directional derivatives will make the network harder to train and thus less accurate i'm not super sure how to interpret this so i might be doing this wrong right here but if you create adversarial example what you do is you essentially have this data point and you create an adversarial example this statement is a well these are of the same class so now that this now the the decision boundary has to sort of bend harder okay which makes it uh more hard to train and at some point it so it's harder to train and that's why you have less accuracy and at some point it says well actually i don't want to bend that much i'd rather make a mistake here and just bend around both of these data points and now you have a wrong classification so that's sort of their explanation of why this happens which i find a bit hand wave you have to argue like ease of training bending the decision boundary and so on in this model right here super easy okay what happens if i create cats that have cat fur and doctor and i tell the network these both are cats well essentially i tell them i tell the network look there are two features right here the fur and the cat and you know the fur just just disregard it just don't do that don't regard the fur as a feature because it's useless now because i now have cats with cat fur and cat with dog fur so the network can't use that to classify anymore and that explains why it gets less accurate because i take away one useful feature okay so you know now the network has less useful features and that's why it gets worse this it's it's a pretty simple explanation in the stretchy feature model it has there's a lot of work to make this happen in the dimpled manifold model so lastly they try to explain and they became an interesting mystery um in this this paper that i have cited throughout and what that is is that it's kind of the same experiment as here where we create adversarial examples and we add them to the training set except for two things first of all we don't have the original so our new data set is not going to contain the original images it's only going to contain the adversarial examples second it is going to contain the adversarial example image but the label isn't going to be the correct label quote-unquote correct from where we created but the label is actually going to be the adversarial label the wrong label okay so we're going to tell the network this is a dog please learn that this is a dog right it's a cat with dog fur and the old training images are nowhere in the data set we just do a data set with these wrongly labeled images now when we go and we apply this so we train we use this we train a network right to classify cats and dogs and now we once we've trained this network we go we take one of these samples of the original data set we classify it it's going to give us a correct classification right so it will recognize that this here is a cat even though we told it that this here is a dog now how does it do this uh it does this by looking at the fur you know we've we've doubled down on the fur here right so this is like we we really made that fur feature super strong in these adversarial examples so it's going to look at the cat fur and even though none of the cats had the shape like this we sort of we sort of supercharged that fur feature again in this model not a problem essentially what we've done is we've created two data classes you know one up here and one down here that have the first supercharged and now it's just going to mainly look at that first structure and that is a useful feature right so this this what's called the features not bugs paper adversarial examples are features not bugs or other way around not bugs they are features um has demonstrated with this experiment this notion that there are adversarial examples result from useful generalizing features in the data set that are simply of by definition the features that are not large enough for humans to see what they call non-robust features how do they explain this they say the original people tried to explain this highly surprising role by distinguishing between robust and non-robust features in any given image where some of them are preserved by the adversarial change and some are not however it is not clear what makes some of the features more robust than others definition just definition like like if you have features and you order them by their size like by their how much you have to change the pixels that some features are going to be larger than other features and then some features going to be below that cutoff where you define adversarial examples but just this is definition makes them such that some of more robust it's not it's not clear our new model provides a very simple alternative explanation which does not necessarily contradict the original one okay at least this which is summarized in figure 4. to simplify the description we will use 2d vertical cut through the input space and consider only the decision boundary that separates between cats and anything else okay so they have this example right here they say look we have a decision boundary that distinguishes cats see from non-cats and the green one here is the image manifold and the gray is the decision boundary okay so now what we do is we create adversarial examples in frame two right here you can see we make the cuts into non-cuts and we make the b the bats into bats aren't very popular lately the badgers into into cats so we make the badgers into cats and we make the cats into the whatever d as ducks okay and now we relabel those and that gives us a new data manifold so the new data manifold is this data manifold right here and we have also new labels and now they claim the resulting decision boundary in figure four as you can see right here this is the resulting decision boundary the gray one it is it is very similar to the decision boundary in the first frame and therefore we shouldn't be surprised that this new decision boundary that results from this perturbed data results in the same decision boundary as the original one okay however um like why like why so their whole they have two notions notion one is that the decision boundary follows the data manifold closely except it sort of bends around the data a little and you can see this right here like this decision boundary kind of follows the data yet it just happens to be on the correct side of the data points um at any given moment which okay okay however they also make the claim in different parts of their paper that bending the decision boundary and so on is not good you'd rather want to have a simple decision boundary so to me there's no reason why the decision boundary couldn't just look like this it would correctly classify this new data set right however it would not correctly classify um it would not correctly classify the let's say the c that was right where was it right here or right here these data points would not correctly classify so you see that this until now they've always had this data manifold to be sort of super duper straight and smooth and that's how they can also say well following the data manifold and not bending too much and so on those are not in conflict with each other but now that they are in conflict with each other you have to give gonna give up one or the other and only in one of them do actually does this experiment here still make sense in the other one it doesn't and but if you give up the oh bending too much is bad then you know you lose a bunch of explanations that you have up here so yeah like it's one in my mind it's one or the other and there's there's still no reason i think no good reason why this like the decision boundary should align super closely with the data points like if there if there is nothing here right if this is perpendicular really to the data manifold like why would the decision boundary align so closely with the data manifold in that point i don't know okay so um they ask why are dnns so sensitive and humans so insensitive to adversarial perturbations essentially their argument here is that humans project the input data onto the image manifold which is a contested claim right i don't i don't think that is a uh i think that is not not a widely accepted i mean it's it's certainly possible uh but also i'm not sure i'm not sure that humans do project they have like an internal manifold of natural images and project onto that every time they analyze an image and also um also the yeah how do you project right like how like both of these features are useful okay so both of the features are useful if you project an adversarial example like why do you project it onto the shape dimension and not onto the fur dimension right why there's no explanation right here we know that sort of humans are more receptive to shapes and so on but just projecting won't get you there so now they're going to into experiments and i want to highlight one particular experiment right here they have synthetic experiments they have other experiments i want to highlight this experiment right here remember they said their experiments were going to give you know strong support that um in this experiment right here what they want to claim is that okay you have the data manifold here if you or if you have a data point and you make an adversarial example the question is um do adversarial examples go along the image manifold or do adversarial examples go sort of perpendicular to the image manifold they their claim again is that the this here would give support to the old view of adversarial examples and this here would support the dimpled manifold view because of course the decision boundary would be sort of following the data manifold curving around the data and then following the image manifold again so here would be sort of the other data point going below that a little bit all right so that is the view right here now what they're going to try to show you is that if you want to create an adversarial example on the manifold you have to walk much longer for much longer until you find an adversarial example than if you go off the manifold if you go yeah and they're also going to show you that if you're not constrained if you can go anywhere you want with an adversarial example then that will be very similar to when you force the adversarial example to go off the manifold and this gives a bit of proof that you know if two things behave equally they're you know probably equal so what they're going to do is they're going to try to make an adversarial attack first of all a regular one this one you're gonna say okay we're gonna make an adversarial attack let's measure how far we have to go to cross the decision boundary second they're going to say let's make the same thing but let's force the attack to be on the manifold of natural images and let's measure that and lastly they're going to mask okay let's do the same thing but force it to be off the data manifold and then they're going to measure how long these are how long the adversarial attacks are what's their their norm and they're going to find of course they're going to want to find that these two are about similar norms and way smaller than the one that is on the data manifold sort of giving evidence to you know if you go perpendicular to the data manifold you have to go very not very far and that's what adversarial attacks do okay uh yeah so how first of all how do they force the the adversarial attack to be on the manifold um what they do is they do an auto encoder so they train an auto encoder so they an autoencoder is a neural network that has sort of a bottleneck layer and you try to just reconstruct the input data okay you tried that these two are equal however in the middle here you have a very low dimensional representation so where this is an n dimensional representation this is a k dimensional representation and a k much smaller than n if you can reconstruct the images correctly that means that you sort of have captured the representation in these low dimensions right here so what they're going to do is they train an auto encoder they take that low dimensional representation they linearize around it and that's how they have a way to project onto the image manifold by simply only moving around in this low dimensional manifold right here or always projecting onto it first of all it's a bit of a trouble because how you train the autoencoder is like for these experiments i think it's very relevant to how the this image manifold is going to look like if you train it with l2 you sort of already make some claims about what are important features and what not but let's disregard this right here let's say they have an accurate way of projecting onto the image manifold onto the manifold of natural data and here's what they find look let's look at imagenet okay no constraint pgd it this is the norm you know it's some number okay so like 0.14 now off manifold pgd is where they deliberately project off the manifold so they project on the manifold they subtract that they say you're not to do anything with the mana the image manifold and that's 0.152 which is slightly larger than the no constraint pgd but essentially the same size now on manifold pgd okay here is a way bigger number like six times bigger number so their claim is look up up to six times uh more you have to go on the manifold than off the manifold and that gives credence to their claims now okay so what i've done is they have you know they have some descriptions of their experiment specifically they have descriptions of what library they use they used adword torch okay so i used advertorch2 uh they used you know l2 pgd i used that too and they told me how much their low dimensional representation is so the k here how much that is how much the n is and so i was able to reproduce that experiment now what i've done is i have done the same thing and you can see right here this is this the panda image from imagenet they used an imagenet classifier and what they do is they do it greedy so they stop as soon as they cross the decision boundary and then they measure the norm you can see right here this is the perturbation now it's a soccer ball and here is the size 0.7772 that's the norm of the original perturbation adversarial what i now do is i project onto the manifold but i don't the difference is i don't project onto the image manifold what i do is here you see project on 2k i simply project onto any k-dimensional manifold so i know what k is k is 3 500. um so it's a very small number compared to the input number and so what they project is actually the gradient so the gradient of the adversarial attack that you use to update your image that's what they project they have the algorithm clearly lined out so what i do is i simply take you can see right here i take a random set of of dimensions like of of pixel coordinates in the gradient and i denote the first you know the first few the first k as the manifold and the last k as not the manifold this is not the image manifold there's nothing to do with the image manifold this is simply a random k-dimensional subspace of the pixel space okay and now when i project onto k i simply take all the others in the gradient and i set them to zero that's i project onto a k-dimensional manifold after that you normalize the gradient and so on so um you proceed you proceed as you would right so here you can see the the project is used before you normalize the gradient so there's no issue with sort of the the step size you simply project onto the manifold and i have the same thing by the way projecting off the manifold where i simply take the the k diamond dimensions and set them to zero okay so now let's look what happens if i project onto the manifold oh wow before it was 0.77 and now it's 6.5 so about 8 times larger and now let's look what happens if i project off the manifold it's 0.7773 instead of 0.7772 so what they're seeing right here and you know maybe okay maybe i've done it modulo i've done it wrong and i completely don't understand what's going on um what they have found is simply an effect of projecting onto any lower dimensional space yet they claim that this is like in support of their hypothesis which clearly i have no clue what the day to manifold is i've just projected onto a random manifold and i got the same results um so i see they have other experiments where they try to kind of convince you with all the types of perturbations and so on but you know like no this these they have other experiments but this is just one that i could try quickly again maybe i've done it wrong to me this occam's razor is strong here like occam's razor in this work is quite a bit like there can be like there can be many hypotheses that coincide with the results you're getting and with the phenomena and it's easy to to think that stuff is in favor of your hypothesis is providing support for it um when there are other explanations available oh i almost forgot about uh goodfellow's claim that you know they say belongs to the sort of old thinking that is now that is not a correct thinking and the claim that when you make an adversarial example you somehow go towards the centroid of a different class and this in imagination it's something like this on the on the left right here however if you think about this in this space okay let's say you start out here and you go towards the centroid of the other class right the pro like where's the center right here approximately like this what happens in feature space because of the stretchy feature because of the different scales okay what happens in feature space is it pretty much like the blue arrow here so it's that in feature space you go a long way actually this is probably i should have drawn this here to be square and this here to be super stretchy right yeah yeah i think so yeah i was i was wrong in drawing this so this here should be squares and this here actually should be super duper stretchy right so the centroid what was the centroid here is like way up here like way up here somewhere okay so this gets super stretched and you cross the boundary in this one feature right like the fur feature and um yeah so i think this is it's still a correct claim you go towards the centroid of another class but because you go this in input space um in the feature space this results in sort of a dramatic shift in some features and a not so dramatic shift in other features so while in the input space you go towards the centroid equally in all pixel directions you don't go towards the centroid equally in all pixel directions in the sorry in all feature directions so i think the claim the good fellow made is valid here still and explains like is concurrent with the stretchy feature explanation i'm pretty sure that's also kind of what maybe i can't read his mind but maybe what he meant by that and not necessarily this picture right here not necessarily that actually the entire picture is going to change into the other class okay that was the interjection and back to the conclusion but as i said make up your own mind what do you what do you think of this um go through the paper they it's it's a good paper like it's written it's written well uh it has a lot of experiments has quite a lot of appendix where they give you more results and so on and it's not like again it's not like it's in it's necessarily incompatible right it's not i don't disagree with them i just think it's it's not as useful as they claim and it's kind of insufficient i don't disagree with their their main claims um yeah and i think we already kind of knew a lot of those stuff and our current mental models are explaining the things uh maybe a little a little better and yeah if you use the the squishy feature what do they call it the the stretchy feature model has a fancy name now but again this this is not mine this is just kind of a a uh bringing together of of what we what i think we know of about adversarial examples uh safe to say there's going to be something that challenges this and that's going to be exciting all right thanks so much for being here listening and i'll see you next time bye bye\n"