Radioactive data - tracing through training (Paper Explained)

Detecting Neural Network Adversarial Attacks: A Review and Analysis of Current Methods

The development of neural networks has revolutionized various fields, including computer vision, natural language processing, and machine learning. However, these networks are not foolproof and can be vulnerable to adversarial attacks. These attacks involve manipulating input data in a way that causes the network to misclassify or produce incorrect outputs.

To detect neural network adversarial attacks, researchers have proposed several methods. One approach is to train a separate classifier on the marked data, which has been intentionally perturbed with noise or other forms of attack. This method involves training a new model on the marked data and then testing it on the unmarked data to see if it can detect the presence of adversarial attacks.

The proposed method in this paper uses a similar approach. The researchers train a neural network on both marked and unmarked data, which allows them to learn the features that distinguish between the two types of data. They then use these learned features to create a classifier that can detect whether the input data is marked or not.

The key idea behind this method is that if the network has been trained on the marked data, it should be able to recognize the patterns and features that are present in both marked and unmarked data. The proposed method uses a similarity measure to evaluate the performance of this classifier. By comparing the similarities between the output distribution of the original network and the predicted output distribution, the researchers can determine whether the input data is marked or not.

The authors argue that this approach has several advantages over other methods. Firstly, it allows for the detection of both labeled and unlabeled adversarial attacks. Secondly, it provides a more robust and reliable method for detecting attacks compared to traditional machine learning-based approaches. Finally, it does not require any additional labels or annotations on the data, making it a feasible solution for real-world applications.

However, there are also some challenges and limitations associated with this approach. One of the main concerns is that the proposed method may not work well if the network has been trained using transfer learning or other forms of pre-training. This could lead to overfitting or underfitting of the classifier, resulting in poor performance on detecting adversarial attacks.

Another limitation of this approach is that it relies heavily on the quality and quantity of the labeled data used for training the original network. If the labeled data is not diverse or representative enough, the network may not learn the necessary features to detect adversarial attacks effectively.

In terms of potential applications, the proposed method has several promising uses. For instance, it could be used in self-driving cars to detect malicious inputs that could cause accidents. It could also be applied in medical imaging to identify tumors or other abnormalities in images.

The paper also discusses several ways to improve this approach. One idea is to use a neural network with multiple hidden layers to learn more complex features and patterns in the data. Another approach involves using a different optimization algorithm for training the original network, such as Adam or RMSprop.

Furthermore, the authors suggest that the proposed method could be combined with other techniques, such as transfer learning or domain adaptation, to improve its performance. This could involve training a new model on both marked and unmarked data and then fine-tuning it using transfer learning.

In conclusion, the proposed method for detecting neural network adversarial attacks is an innovative approach that has several advantages over traditional methods. However, there are also some challenges and limitations associated with this approach, such as relying on high-quality labeled data and potentially overfitting or underfitting the classifier. To overcome these limitations, researchers may need to explore alternative approaches, such as using more complex neural networks or combining this method with other techniques.

One potential idea is to craft inputs in some way that correlates two of the hidden features, making it difficult for the network to recognize their independence at test time. This could involve training a new model on both marked and unmarked data and then fine-tuning it using transfer learning. By doing so, researchers may be able to develop more sophisticated and effective methods for detecting adversarial attacks in neural networks.

The authors also point out that even if the network has been trained with radioactive data, there is still a one-in-a-million chance that it will not detect the attack. Therefore, they suggest using multiple classifiers or ensemble methods to improve detection accuracy.

In summary, the proposed method for detecting neural network adversarial attacks is an important contribution to the field of machine learning and computer vision. While there are some challenges and limitations associated with this approach, researchers may be able to overcome these by exploring alternative techniques and approaches. By developing more effective methods for detecting adversarial attacks, we can build more secure and reliable neural networks that can handle a wide range of inputs and applications.

The paper also highlights the importance of testing and evaluation in machine learning and computer vision. The authors propose several tests, including black box and white box tests, to evaluate the performance of their proposed method. These tests involve feeding data into the network and measuring its response, allowing researchers to determine whether the input is marked or not.

Overall, this paper provides an important contribution to the field of machine learning and computer vision, highlighting the importance of detecting adversarial attacks in neural networks. By exploring alternative approaches and techniques, researchers may be able to develop more effective methods for detecting these attacks and building more secure and reliable neural networks.

The proposed method has several advantages over traditional methods, including its ability to detect both labeled and unlabeled adversarial attacks. It also provides a more robust and reliable method for detecting attacks compared to traditional machine learning-based approaches. However, there are some challenges and limitations associated with this approach, such as relying on high-quality labeled data and potentially overfitting or underfitting the classifier.

To overcome these limitations, researchers may need to explore alternative approaches, such as using more complex neural networks or combining this method with other techniques. By doing so, they can develop more sophisticated and effective methods for detecting adversarial attacks in neural networks.

In conclusion, the proposed method for detecting neural network adversarial attacks is an important contribution to the field of machine learning and computer vision. While there are some challenges and limitations associated with this approach, researchers may be able to overcome these by exploring alternative techniques and approaches. By developing more effective methods for detecting adversarial attacks, we can build more secure and reliable neural networks that can handle a wide range of inputs and applications.

The authors also argue that the proposed method is feasible for real-world applications, such as self-driving cars or medical imaging. They suggest using this approach to detect malicious inputs in these domains, which could potentially prevent accidents or misdiagnoses.

In summary, this paper provides an important contribution to the field of machine learning and computer vision, highlighting the importance of detecting adversarial attacks in neural networks. By exploring alternative approaches and techniques, researchers may be able to develop more effective methods for detecting these attacks and building more secure and reliable neural networks.

One potential idea is to use a different optimization algorithm for training the original network, such as Adam or RMSprop. This could involve adjusting the learning rate, batch size, or other hyperparameters to improve performance on detecting adversarial attacks.

Another approach involves using a more complex neural network architecture, such as a multi-layer perceptron (MLP) or a convolutional neural network (CNN). By increasing the complexity of the network, researchers may be able to learn more sophisticated features and patterns in the data that can help detect adversarial attacks.

The authors also suggest combining this method with other techniques, such as transfer learning or domain adaptation. This could involve training a new model on both marked and unmarked data and then fine-tuning it using transfer learning.

Overall, the proposed method has several advantages over traditional methods, including its ability to detect both labeled and unlabeled adversarial attacks. It also provides a more robust and reliable method for detecting attacks compared to traditional machine learning-based approaches. However, there are some challenges and limitations associated with this approach, such as relying on high-quality labeled data and potentially overfitting or underfitting the classifier.

In conclusion, the proposed method for detecting neural network adversarial attacks is an important contribution to the field of machine learning and computer vision.

"WEBVTTKind: captionsLanguage: enare you tired of other people training on your data that annoys me every time it happens ah i'm mad about this uh if only there was a way to somehow mark your data and when other people train on it their computer would explode well this paper is a little bit like this not entirely the explosion part i think they're still working on on a follow-up paper but in this case in this paper called radioactive data tracing through training um by alexander sableroll matis dus cordelia schmidt and erve jigu they develop a method that at least you can detect if a given model was trained on your data or not on your data and they call this process radioactive marking or radioactive data for short so the overview you can see it's pretty easy paper actually the concept is pretty easy and it's a nice concept and it's been around in one form or another it touches on adversarial examples it touches on differential privacy but in essence it works like this if you have suspect if you suspect someone else training on your data or if you just have a data set that you want to protect what you do is you mark it you mark it with this mark and they call this a like a radioactive mark but essentially you just distort your images a little bit then um when someone else trains on that data so here a convolutional neural network is trained on this data and not all of the data needs to be marked they can go as little as like one or two percent of the data being marked then from the output of that network or from the net inspecting the network itself you can then test whether or not this network has been trained on this radioactively labeled data so you will see a clear difference to a network that has been trained on only what they call vanilla data so data that has not been marked so i hope that's that's clear what you do what you do is you train sorry you mark your data what the kind of what bob does no what's the attacker's name i don't know but what eve does um is train here a network on data and you don't know whether it's this or this and then you do a test to figure out which one it is okay so we'll dive into the method and look at how well this works pretty pretty simple but pretty cool so their entire method rests on this kind of notion that these classifiers what they do is if you have a neural network like a convolutional neural network you have your image your starting image of your prototypical i don't know cat and you input this into many many layers of a neural network as we are used to but the last layer is a bit special right because the last layer is the classification layer if let's let's just assume this is a classifier so if this is c410 for example there are ten different classes that you could output and so 10 of these bubbles right here that means that this matrix right here is a number of features let's call it d by 10 matrix okay so the network this part right here we would usually call a feature extractor something like this so the bottom part of the network basically does this it's non-linear transformation and so on extracts d features these are latent features and then those features are linearly classified into 10 classes okay the important part here is that that last layer is actually just a linear classifier and we can reduce this actually down to a two class classifier so the five function would just put points here in somehow you know i let's just make them two classes the x's and the o's and so on so if the phi is good then the last layer has a pretty easy job linearly classifying it right here you can see here the file is not very good we can't linearly classify this data so by training the neural network what you do is you make phi such that it will place hopefully the one class somehow on one side the other class on the other side and you can pretty easily linearly classify that data okay the exact slope of this of this line right here the exact location of this line and direction of this line that's what's encoded ultimately in this matrix right here so this matrix now not only for two classes but for 10 different classes it it records these hyper planes that separate one class from the other class and these are in d dimensional space so you have d d-dimensional 10 d-dimensional hyperplanes separating the space of features linearly into the classes so what you can do is you can actually think of this d um sorry of these d dimensions here as features right this is a feature extractor so it provides features to a linear classifier now what this method does is when it radioactively marks data points it simply adds a feature okay so how do you think about these features so for example let's say this is actually this animal classification example and if you are if you are asked to classify cats from dogs from horses and so on one feature could be does it have whiskers whiskers one feature could be does it have fur right you can maybe distinguish cats from turtles and so cats and dogs from turtles um does it have how many legs so the number of legs and so on so you have all these features and the last layer simply linearly classifies those features together what this method does this radioactive measure that it adds a new feature per class so um down here i would add a new feature that says like this is the radioactive feature can i draw the radioactive symbol this is the radioactive feature for the class cat okay and then of course i also have one for dog and and so on so it would add or basically would you don't change the dimensionality but in essence you add one feature per class and that's what they mean here by this direction u so in this high dimensional space that is spanned by these uh d-dimensional vectors and you can so this thing here okay sorry i'm switching back and forth this thing here you can sort of if d is equal to 2 you can imagine it as 10 vectors in a space in this feature space okay 10 of these vectors and whenever you get a point that's is that eight whenever you get a point you simply look at so if you get a data point right in here goes through here you come here and you look with which class does it align more the most and that's how you classify it okay so if you think of this then what you what you want to do is you want to add a feature here such that um this is one per class i'm having trouble articulating this and you want to change your data points here you can see your data points and for this class x we make this radioactive feature right here which is the the blue thing we shift the data into the direction of this feature okay so basically we add the feature u which is just a random vector in this high dimensional space we choose one vector per class but then we shift all the data for that class along this feature so what we are doing is we are introducing fake a fake feature that we derive from the label right so we we kind of cheated here we have x and you're supposed to tell y from it and that's your training data but then we cheat we look at y and we modify x with the feature of that particular class so what does that do ultimately we have we end up with u1 u2 and so on so one feature per class it trains the classifier to pay attention to these features right so if u1 is the feature for cat then we train this classifier by training it on the data that has been modified in this way we train it a cat should consist of something that has whiskers has fur has four legs and so on and also has this cat feature okay now the um the danger of course here is that the classifier will will stop to pay attention to anything else and only look at the cat feature because we introduced this feature to every single example that was of class cat so the classifier could have a pretty easy way just looking at this feature determining well all of this is cat and then it would not generalize at all so what we can do is first of all we can make the feature very low signal we can make it very small such that there are other features such as these other features are also pretty easy for the network to pay attention to and second of all we can label not all data and that's what they do here they label maybe ten percent maybe two percent of the data with that which forces the network to pay some attention to this feature but also to pay attention to the other features and that ultimately if you trade this off correctly results in a classifier that it does give up some of its generalization capability because of course zero percent of the test data has these features right here we modify the training data uh to add these features so you give up a little bit of generalization capability but but you force the classifier to pay attention to this feature during training and that is something that you can then detect so you can imagine if you train a classifier that has been trained on training data where some of the training data have these features in here and that's one distinct feature per class right then you can look at the final classifier and figure out whether or not um whether or not the classifier has been trained how how do we do that so let's imagine that in this high dimensional space here the training examples they all you know they point in kind of this direction right here okay so all the training examples of one particular class so this is now the dog class all the training examples point here how would you build your classifier well it's pretty easy i would build it such that the dog class points in this direction okay i'm just erased a bunch of other classes right here now i choose a random feature when i build my radioactive thing i choose a random feature like this one right here okay and what i'll do is i'll shift my training data a bit into that direction okay um how do we do this how are we doing this i'll i'll just dash it okay so i'll shift my training data a little bit into this direction so all of these they move over right here and that's where the final classifier will come to lie a lot more towards this new feature and this is something we can now test with a statistical test and that's what this paper kind of works out in the math so usually if you have two if you have one vector in high dimensional space like this one and then you look at the distribution of random vectors so this one maybe this one this one feels pretty random this one's pretty random okay humans are terrible random number generators but these feel pretty random and you look at the cosines between the random vector and the vector you plotted initially they follow if this is truly random they follow a distribution they follow this particular distribution that they that they derive here okay so you can see a classic result from statistics shows that this cosine similarity follows incomplete beta distribution with these parameters now they from this they derive a statistical test so if you know what kind of distribution i um a quantity follows you can derive a statistical test to see whether or not what you measure is actually likely to come from that distribution or not so what we would expect if our data has not been modified is that you know we we choose a random direction a random direction u right here um this is u for dog we choose that random direction and if our training date has not been modified we would expect this dog here to have its cosine similarity to be not very high because there's no reason for it right these are just basically two vectors that are random to each other and in high dimensions they should be almost orthogonal so in high dimensions random vectors are almost orthogonal however if the data has been marked during before training that means if the classifier used our marked data set to train it we would expect this cosine similarity right here to be not orthogonal so to be higher than just random and that's exactly what we can test and that's exactly what you saw at the beginning right here so here is the down here you can see the distribution of cosine similarities and um you can see that if you train with without marked data this centers you know around zero however if you train with marked data you have a statistically significant shift between the marking direction the marking feature and between the classifier direction so the all you have to do is mark your data in this way and then look at the final classifier look and these blue vectors right here these are just the entries of this final weight matrix right these are the blue vectors you look at those and you simply determine if the for the given class if the vector for the given class has a high cosine similarity with the marking direction that you chose to mark your data if it does you can be fairly sure that the network has been trained using your data okay so i hope the principle is clear you introduce a fake feature per class and you make the network pay a little bit of attention to that feature because it's you know a good feature in the training data and then at you know after training you can go ahead and see whether or not the network is actually sensitive to that feature that you fake introduce that is actually not a real feature in the data if the network is sensitive to it you can conclude that um you can conclude that your training data was used uh in order to produce it so there's a couple of finesses right here um so as you might have noticed we introduced these fake features in this last layer feature space right here however our pictures are actually input here in front in front of this feature extractor so we need a way to say what we want to do is we want to say i want this data point here to be shifted in this direction but i actually this data point is actually a result from an input data point i'm going to call this i right here going through a non-linear neural network ending up here so the way this is done is by using the same kind of back propagation that we use when we create adversarial examples so what we do is we define this distance or this distance here where we would like to go and where we are as a loss and then back propagate that loss through the neural network and then at the end we know how to change the image i in order to adjust that feature so they define a loss right here that they minimize and you can see here is where you want to go in feature space and they have different regularizers such that their perturbation in input space is not too high and also here their perturbation in feature space is actually not too high so they they want they also have the goal that this radioactive marking cannot be detected first of all and also that is it's it's a robust to re-labeling like if you give me data and i go and re-label it and ask my mechanical turk workers to relabel that data again they will give them the same the same label even if you have radioactively marked them right this paper says nothing about defenses right these things are defended against fairly easily i would guess by by some gaussian blur uh i guess would be fairly effective right here though there are also ways around this this gets into the same discussion as adversarial examples the question here is can you detect somehow in the final classifier whether or not this someone has smuggled radioactive data into you into your training process i'm not sure but i'm also sure there are better ways to radioactively mark right here this is kind of an establishing paper um doing the most basic thing right here interestingly they also back propagate through kind of data augmentation procedures as long as they are differentiable and the last kind of difficulty you have is that these neural networks they are they have some symmetries built into them so if you retrain a neural network there is actually no um so if your neural network's classification let's say it's a three-class classification looks like this right this is the last layer and these are the classes it's determined if you retrain it it might as well be that this now looks like this right so um if you marked it with this direction right here and then you try to recover this direction you'll find that it doesn't work because the entire classifier has shifted so what they have to do is they have to do what they call a subspace alignment which you can do by simply um here determining a linear transformation in the last layer this is usually enough and what this does is so their entire procedure is they train themselves a classifier on unmarked data i forgot this before i should have mentioned this they train themselves a classifier on unmarked data they use that classifier to mark the data which you know you need in order to do this back propagation thing you actually need a working classifier and then when they give the data to someone else to train they are going to train their own classifier on the same data right so there is no guarantee that these two classifiers spaces align especially because you have this kind of symmetry and they say right here we can fix that by if you know we have this classifier and at the end they give us this classifier to test um we can simply determining this linear transformation here that maps one to the other so we go over our data set we determine m a linear transformation so basically here you would determine a rotation of this space that would map one to the other and vice versa this is not exact of course because the two classifiers there's no reason why they should even be linearly related but there is a reason coming from kind of neural network knowledge and that is that we we know or we have a strong suspicion that these neural networks of course if they work well and if they reach good accuracy and if they reach similar accuracy it's very probable that they have somehow figured out the same features okay even though these networks learn each feature from scratch and that you as i said you have some symmetries in there but ultimately at the end of the neural network is very likely that the network learns the same features as another network even of a different architecture that has been trained on the same data set this is i think this is supported by research in adversarial examples like this paper adversarial examples are features not bugs you know transfer learning transfer of adversarial examples all of this kind of stuff points to the fact that these networks tend to extract the same features and therefore we can assume pretty safely that up to this kind of rotational symmetry our space our feature space will look the same and therefore determining kind of this linear transformation should probably be enough in order to align one subspace with the other and be able to compare your own classifier that you have trained to mark the data to the classifier that someone else has trained on the same data okay this is kind of the one of the difficulties they had to overcome and um yeah they evaluate this uh they evaluate this experimentally you can see right here on the right 20 of the training data has been marked in order for the orange um for the orange data this is these are random directions so blue would be the correlation with random directions and because sorry orange is the correlation with these carrier directions with the directions of the fake features and green is the alignment with actually the features of the classes itself so you can see even if 20 of the data is marked the classifier still aligns mostly with the features of the actual classification problem it aligns a little bit with the features of the um fake features or with the fake features and it does so such that there is a statistically significant difference between random directions and these and you can see even if two percent of the uh data only are marked so only two percent of the training data has this mark and the mark is always imperceptible right the mark is always such that you can't see it by eye even then you can see that there is a difference so the classifier does learn to pay attention to that feature which is something you can detect afterwards this experiment on the left here is just the same basically saying so up here it starts with not a lot of not a lot of data being marked and you can see it mostly aligns with the semantic direction which is the true features as you mark more and more of the data it goes down and down and down but it does not so i think this is 50 is the yellow 50 of the data is marked and still you can see there is a pretty good alignment with the actual features because the network um will start paying more and more attention to your fake features because they're pretty good predictors right but it also has this other training data that it can solve using those features so it still needs to pay attention and of course your marked data also has these these other true features so it is to be expected that even though your data is marked it's still the classifier still aligns more with the true features than with your fake features and they also show in experiments that you do not sacrifice a lot in accuracy so here you can see the delta in accuracy it through their experiments is fairly fairly low and they they do imagenet on resnet18 so these differences in accuracies there they are you know you notice but they are fairly small so you know some someone someone also couldn't just go on on a big accuracy drop when training on data like this so someone someone training with data couldn't just notice that it's radioactively marked by just saying like well this doesn't work at all i guess some clustering approaches would work where you look at the features and you just see this one feature is like only present in this very particular group of data that i got from this very shady person selling me 3.5 inch floppy disks around the street corner but other than that yeah it's not really it's not really detectable for someone training on it and lastly they have black box they defend against black box attacks and here is where i'm a bit skeptical they say well if we're we don't have access to the model what we can still do is basically uh this is here what we can still do is we can analyze the loss so we can analyze the loss value of um the radioactively marked data and if the network we're testing is has significantly lower loss on our on the radioactively marked data than on non-marked data then that's an indication that they trained on marked data which you know if you don't have access to the model like what's the probability that you have access to the loss of the model like the usually you need you need the output distribution or something it's a bit shady what i would do actually is um just a little bit more uh sophisticated but what you could do is you could take your direction you right you could back propagate it through your network to derive like a pure adversarial example so not even going from from some image just go from random noise like just derive like a super duper a image that only has that one feature like and then input that into this classifier so this is yours and then input that into the classifier that you are testing okay and if that classifier gives you back the class that you just you know each one of these use is actually of a given class right so you have one feature per class if that gives you back the class of that feature you have a pretty strong indication that someone has been training on your data because so if you look at data in general as we said it has these true features and if it's marked it also has the fake features so what kind of class it's going for you can detect in the output distribution but if you then input like a pure only the fake feature and it still comes out the class that you assigned to the fake feature you know there is a one over number of classes uh probability only that that happens by chance and if you want you can derive a different you can do this again you can drive a different um pure only this feature sample input it again and look what comes out so um it's not it's not a pure test so these are not going to be independent so you probably shouldn't like just multiply but i would think a procedure like this and maybe they'd do this somewhere but they'd simply say we can look at the loss of marked and unmarked data which you know i'm i'm not so sure that that's going to work fairly well okay um as i said there are going to be many many ways to improve this the paper has more experiments ablations transfer learning between architectures and so on i just want to point out i have a so there's a bit of a an issue here where where i think there is a lot of room to grow uh first of all here you simply train the network and then you look at the network at the end right you simply look at these 10 vectors right here and you determine their inner product with the marking directions and that's you know that's what you what you go by what i would what i would like to see as an iteration of this is where you have a neural network and you you can't just detect by looking at the end what what you'd have to do you'd have to be much more sneaky so in order to avoid detection detecting your detecting strategy so in order to avoid defenses against this um i would i would guess what you want to do is not just you know make the network such that in the end it's fairly obvious if by looking at this last matrix maybe you should only be able to detect this uh at the end by actually feeding data into it like we did with the black box test but if we had a white box test by feeding data into it and then um and then looking at the responses of the network so but someone couldn't not tell it was trained with radioactive data by just looking at the network's weights so maybe one idea would be that you craft inputs in some way that correlates two of the hidden features so let's say we have some hidden layer here and one here and these features are learned by the network right and they appear to be fairly independent so you make sure that they are fairly independent during if you pass regular data and then you craft data specifically you craft data like you did here with the marking that makes the network correlate the two features but has little effect actually on the output distribution of the classes so you can retain your generalization much more right it doesn't change this last layer necessarily that much or not in a completely class dependent fashion what i would simply do is i would correlate two of these internal features i would force the network to learn to correlate them and because then i would expect this to be much more you know secretive and then at test time i can simply introduce my forge data again and look whether or not the internal responses are actually correlated um as i said i could do this across classes to cancel out the effect of this actually being a feature for one given class and therefore changing the network's accuracy too much i think that would be a cool next direction to go into and again this should work because even the intermediate features we have good reason to assume that different networks even different architectures different training runs learn the same kind of intermediate features the question is only in the next network that feature could actually be like you know two layers up or three layers down or and so on so you'd have to learn some kind of more sophisticated alignment there but still i think that would be um kind of an iteration of this which would be cool um you know if if you're doing this inside the channel um yeah all right so that was it uh for me for this paper as i said pretty simple paper pretty cool idea and i'll see you next time byeare you tired of other people training on your data that annoys me every time it happens ah i'm mad about this uh if only there was a way to somehow mark your data and when other people train on it their computer would explode well this paper is a little bit like this not entirely the explosion part i think they're still working on on a follow-up paper but in this case in this paper called radioactive data tracing through training um by alexander sableroll matis dus cordelia schmidt and erve jigu they develop a method that at least you can detect if a given model was trained on your data or not on your data and they call this process radioactive marking or radioactive data for short so the overview you can see it's pretty easy paper actually the concept is pretty easy and it's a nice concept and it's been around in one form or another it touches on adversarial examples it touches on differential privacy but in essence it works like this if you have suspect if you suspect someone else training on your data or if you just have a data set that you want to protect what you do is you mark it you mark it with this mark and they call this a like a radioactive mark but essentially you just distort your images a little bit then um when someone else trains on that data so here a convolutional neural network is trained on this data and not all of the data needs to be marked they can go as little as like one or two percent of the data being marked then from the output of that network or from the net inspecting the network itself you can then test whether or not this network has been trained on this radioactively labeled data so you will see a clear difference to a network that has been trained on only what they call vanilla data so data that has not been marked so i hope that's that's clear what you do what you do is you train sorry you mark your data what the kind of what bob does no what's the attacker's name i don't know but what eve does um is train here a network on data and you don't know whether it's this or this and then you do a test to figure out which one it is okay so we'll dive into the method and look at how well this works pretty pretty simple but pretty cool so their entire method rests on this kind of notion that these classifiers what they do is if you have a neural network like a convolutional neural network you have your image your starting image of your prototypical i don't know cat and you input this into many many layers of a neural network as we are used to but the last layer is a bit special right because the last layer is the classification layer if let's let's just assume this is a classifier so if this is c410 for example there are ten different classes that you could output and so 10 of these bubbles right here that means that this matrix right here is a number of features let's call it d by 10 matrix okay so the network this part right here we would usually call a feature extractor something like this so the bottom part of the network basically does this it's non-linear transformation and so on extracts d features these are latent features and then those features are linearly classified into 10 classes okay the important part here is that that last layer is actually just a linear classifier and we can reduce this actually down to a two class classifier so the five function would just put points here in somehow you know i let's just make them two classes the x's and the o's and so on so if the phi is good then the last layer has a pretty easy job linearly classifying it right here you can see here the file is not very good we can't linearly classify this data so by training the neural network what you do is you make phi such that it will place hopefully the one class somehow on one side the other class on the other side and you can pretty easily linearly classify that data okay the exact slope of this of this line right here the exact location of this line and direction of this line that's what's encoded ultimately in this matrix right here so this matrix now not only for two classes but for 10 different classes it it records these hyper planes that separate one class from the other class and these are in d dimensional space so you have d d-dimensional 10 d-dimensional hyperplanes separating the space of features linearly into the classes so what you can do is you can actually think of this d um sorry of these d dimensions here as features right this is a feature extractor so it provides features to a linear classifier now what this method does is when it radioactively marks data points it simply adds a feature okay so how do you think about these features so for example let's say this is actually this animal classification example and if you are if you are asked to classify cats from dogs from horses and so on one feature could be does it have whiskers whiskers one feature could be does it have fur right you can maybe distinguish cats from turtles and so cats and dogs from turtles um does it have how many legs so the number of legs and so on so you have all these features and the last layer simply linearly classifies those features together what this method does this radioactive measure that it adds a new feature per class so um down here i would add a new feature that says like this is the radioactive feature can i draw the radioactive symbol this is the radioactive feature for the class cat okay and then of course i also have one for dog and and so on so it would add or basically would you don't change the dimensionality but in essence you add one feature per class and that's what they mean here by this direction u so in this high dimensional space that is spanned by these uh d-dimensional vectors and you can so this thing here okay sorry i'm switching back and forth this thing here you can sort of if d is equal to 2 you can imagine it as 10 vectors in a space in this feature space okay 10 of these vectors and whenever you get a point that's is that eight whenever you get a point you simply look at so if you get a data point right in here goes through here you come here and you look with which class does it align more the most and that's how you classify it okay so if you think of this then what you what you want to do is you want to add a feature here such that um this is one per class i'm having trouble articulating this and you want to change your data points here you can see your data points and for this class x we make this radioactive feature right here which is the the blue thing we shift the data into the direction of this feature okay so basically we add the feature u which is just a random vector in this high dimensional space we choose one vector per class but then we shift all the data for that class along this feature so what we are doing is we are introducing fake a fake feature that we derive from the label right so we we kind of cheated here we have x and you're supposed to tell y from it and that's your training data but then we cheat we look at y and we modify x with the feature of that particular class so what does that do ultimately we have we end up with u1 u2 and so on so one feature per class it trains the classifier to pay attention to these features right so if u1 is the feature for cat then we train this classifier by training it on the data that has been modified in this way we train it a cat should consist of something that has whiskers has fur has four legs and so on and also has this cat feature okay now the um the danger of course here is that the classifier will will stop to pay attention to anything else and only look at the cat feature because we introduced this feature to every single example that was of class cat so the classifier could have a pretty easy way just looking at this feature determining well all of this is cat and then it would not generalize at all so what we can do is first of all we can make the feature very low signal we can make it very small such that there are other features such as these other features are also pretty easy for the network to pay attention to and second of all we can label not all data and that's what they do here they label maybe ten percent maybe two percent of the data with that which forces the network to pay some attention to this feature but also to pay attention to the other features and that ultimately if you trade this off correctly results in a classifier that it does give up some of its generalization capability because of course zero percent of the test data has these features right here we modify the training data uh to add these features so you give up a little bit of generalization capability but but you force the classifier to pay attention to this feature during training and that is something that you can then detect so you can imagine if you train a classifier that has been trained on training data where some of the training data have these features in here and that's one distinct feature per class right then you can look at the final classifier and figure out whether or not um whether or not the classifier has been trained how how do we do that so let's imagine that in this high dimensional space here the training examples they all you know they point in kind of this direction right here okay so all the training examples of one particular class so this is now the dog class all the training examples point here how would you build your classifier well it's pretty easy i would build it such that the dog class points in this direction okay i'm just erased a bunch of other classes right here now i choose a random feature when i build my radioactive thing i choose a random feature like this one right here okay and what i'll do is i'll shift my training data a bit into that direction okay um how do we do this how are we doing this i'll i'll just dash it okay so i'll shift my training data a little bit into this direction so all of these they move over right here and that's where the final classifier will come to lie a lot more towards this new feature and this is something we can now test with a statistical test and that's what this paper kind of works out in the math so usually if you have two if you have one vector in high dimensional space like this one and then you look at the distribution of random vectors so this one maybe this one this one feels pretty random this one's pretty random okay humans are terrible random number generators but these feel pretty random and you look at the cosines between the random vector and the vector you plotted initially they follow if this is truly random they follow a distribution they follow this particular distribution that they that they derive here okay so you can see a classic result from statistics shows that this cosine similarity follows incomplete beta distribution with these parameters now they from this they derive a statistical test so if you know what kind of distribution i um a quantity follows you can derive a statistical test to see whether or not what you measure is actually likely to come from that distribution or not so what we would expect if our data has not been modified is that you know we we choose a random direction a random direction u right here um this is u for dog we choose that random direction and if our training date has not been modified we would expect this dog here to have its cosine similarity to be not very high because there's no reason for it right these are just basically two vectors that are random to each other and in high dimensions they should be almost orthogonal so in high dimensions random vectors are almost orthogonal however if the data has been marked during before training that means if the classifier used our marked data set to train it we would expect this cosine similarity right here to be not orthogonal so to be higher than just random and that's exactly what we can test and that's exactly what you saw at the beginning right here so here is the down here you can see the distribution of cosine similarities and um you can see that if you train with without marked data this centers you know around zero however if you train with marked data you have a statistically significant shift between the marking direction the marking feature and between the classifier direction so the all you have to do is mark your data in this way and then look at the final classifier look and these blue vectors right here these are just the entries of this final weight matrix right these are the blue vectors you look at those and you simply determine if the for the given class if the vector for the given class has a high cosine similarity with the marking direction that you chose to mark your data if it does you can be fairly sure that the network has been trained using your data okay so i hope the principle is clear you introduce a fake feature per class and you make the network pay a little bit of attention to that feature because it's you know a good feature in the training data and then at you know after training you can go ahead and see whether or not the network is actually sensitive to that feature that you fake introduce that is actually not a real feature in the data if the network is sensitive to it you can conclude that um you can conclude that your training data was used uh in order to produce it so there's a couple of finesses right here um so as you might have noticed we introduced these fake features in this last layer feature space right here however our pictures are actually input here in front in front of this feature extractor so we need a way to say what we want to do is we want to say i want this data point here to be shifted in this direction but i actually this data point is actually a result from an input data point i'm going to call this i right here going through a non-linear neural network ending up here so the way this is done is by using the same kind of back propagation that we use when we create adversarial examples so what we do is we define this distance or this distance here where we would like to go and where we are as a loss and then back propagate that loss through the neural network and then at the end we know how to change the image i in order to adjust that feature so they define a loss right here that they minimize and you can see here is where you want to go in feature space and they have different regularizers such that their perturbation in input space is not too high and also here their perturbation in feature space is actually not too high so they they want they also have the goal that this radioactive marking cannot be detected first of all and also that is it's it's a robust to re-labeling like if you give me data and i go and re-label it and ask my mechanical turk workers to relabel that data again they will give them the same the same label even if you have radioactively marked them right this paper says nothing about defenses right these things are defended against fairly easily i would guess by by some gaussian blur uh i guess would be fairly effective right here though there are also ways around this this gets into the same discussion as adversarial examples the question here is can you detect somehow in the final classifier whether or not this someone has smuggled radioactive data into you into your training process i'm not sure but i'm also sure there are better ways to radioactively mark right here this is kind of an establishing paper um doing the most basic thing right here interestingly they also back propagate through kind of data augmentation procedures as long as they are differentiable and the last kind of difficulty you have is that these neural networks they are they have some symmetries built into them so if you retrain a neural network there is actually no um so if your neural network's classification let's say it's a three-class classification looks like this right this is the last layer and these are the classes it's determined if you retrain it it might as well be that this now looks like this right so um if you marked it with this direction right here and then you try to recover this direction you'll find that it doesn't work because the entire classifier has shifted so what they have to do is they have to do what they call a subspace alignment which you can do by simply um here determining a linear transformation in the last layer this is usually enough and what this does is so their entire procedure is they train themselves a classifier on unmarked data i forgot this before i should have mentioned this they train themselves a classifier on unmarked data they use that classifier to mark the data which you know you need in order to do this back propagation thing you actually need a working classifier and then when they give the data to someone else to train they are going to train their own classifier on the same data right so there is no guarantee that these two classifiers spaces align especially because you have this kind of symmetry and they say right here we can fix that by if you know we have this classifier and at the end they give us this classifier to test um we can simply determining this linear transformation here that maps one to the other so we go over our data set we determine m a linear transformation so basically here you would determine a rotation of this space that would map one to the other and vice versa this is not exact of course because the two classifiers there's no reason why they should even be linearly related but there is a reason coming from kind of neural network knowledge and that is that we we know or we have a strong suspicion that these neural networks of course if they work well and if they reach good accuracy and if they reach similar accuracy it's very probable that they have somehow figured out the same features okay even though these networks learn each feature from scratch and that you as i said you have some symmetries in there but ultimately at the end of the neural network is very likely that the network learns the same features as another network even of a different architecture that has been trained on the same data set this is i think this is supported by research in adversarial examples like this paper adversarial examples are features not bugs you know transfer learning transfer of adversarial examples all of this kind of stuff points to the fact that these networks tend to extract the same features and therefore we can assume pretty safely that up to this kind of rotational symmetry our space our feature space will look the same and therefore determining kind of this linear transformation should probably be enough in order to align one subspace with the other and be able to compare your own classifier that you have trained to mark the data to the classifier that someone else has trained on the same data okay this is kind of the one of the difficulties they had to overcome and um yeah they evaluate this uh they evaluate this experimentally you can see right here on the right 20 of the training data has been marked in order for the orange um for the orange data this is these are random directions so blue would be the correlation with random directions and because sorry orange is the correlation with these carrier directions with the directions of the fake features and green is the alignment with actually the features of the classes itself so you can see even if 20 of the data is marked the classifier still aligns mostly with the features of the actual classification problem it aligns a little bit with the features of the um fake features or with the fake features and it does so such that there is a statistically significant difference between random directions and these and you can see even if two percent of the uh data only are marked so only two percent of the training data has this mark and the mark is always imperceptible right the mark is always such that you can't see it by eye even then you can see that there is a difference so the classifier does learn to pay attention to that feature which is something you can detect afterwards this experiment on the left here is just the same basically saying so up here it starts with not a lot of not a lot of data being marked and you can see it mostly aligns with the semantic direction which is the true features as you mark more and more of the data it goes down and down and down but it does not so i think this is 50 is the yellow 50 of the data is marked and still you can see there is a pretty good alignment with the actual features because the network um will start paying more and more attention to your fake features because they're pretty good predictors right but it also has this other training data that it can solve using those features so it still needs to pay attention and of course your marked data also has these these other true features so it is to be expected that even though your data is marked it's still the classifier still aligns more with the true features than with your fake features and they also show in experiments that you do not sacrifice a lot in accuracy so here you can see the delta in accuracy it through their experiments is fairly fairly low and they they do imagenet on resnet18 so these differences in accuracies there they are you know you notice but they are fairly small so you know some someone someone also couldn't just go on on a big accuracy drop when training on data like this so someone someone training with data couldn't just notice that it's radioactively marked by just saying like well this doesn't work at all i guess some clustering approaches would work where you look at the features and you just see this one feature is like only present in this very particular group of data that i got from this very shady person selling me 3.5 inch floppy disks around the street corner but other than that yeah it's not really it's not really detectable for someone training on it and lastly they have black box they defend against black box attacks and here is where i'm a bit skeptical they say well if we're we don't have access to the model what we can still do is basically uh this is here what we can still do is we can analyze the loss so we can analyze the loss value of um the radioactively marked data and if the network we're testing is has significantly lower loss on our on the radioactively marked data than on non-marked data then that's an indication that they trained on marked data which you know if you don't have access to the model like what's the probability that you have access to the loss of the model like the usually you need you need the output distribution or something it's a bit shady what i would do actually is um just a little bit more uh sophisticated but what you could do is you could take your direction you right you could back propagate it through your network to derive like a pure adversarial example so not even going from from some image just go from random noise like just derive like a super duper a image that only has that one feature like and then input that into this classifier so this is yours and then input that into the classifier that you are testing okay and if that classifier gives you back the class that you just you know each one of these use is actually of a given class right so you have one feature per class if that gives you back the class of that feature you have a pretty strong indication that someone has been training on your data because so if you look at data in general as we said it has these true features and if it's marked it also has the fake features so what kind of class it's going for you can detect in the output distribution but if you then input like a pure only the fake feature and it still comes out the class that you assigned to the fake feature you know there is a one over number of classes uh probability only that that happens by chance and if you want you can derive a different you can do this again you can drive a different um pure only this feature sample input it again and look what comes out so um it's not it's not a pure test so these are not going to be independent so you probably shouldn't like just multiply but i would think a procedure like this and maybe they'd do this somewhere but they'd simply say we can look at the loss of marked and unmarked data which you know i'm i'm not so sure that that's going to work fairly well okay um as i said there are going to be many many ways to improve this the paper has more experiments ablations transfer learning between architectures and so on i just want to point out i have a so there's a bit of a an issue here where where i think there is a lot of room to grow uh first of all here you simply train the network and then you look at the network at the end right you simply look at these 10 vectors right here and you determine their inner product with the marking directions and that's you know that's what you what you go by what i would what i would like to see as an iteration of this is where you have a neural network and you you can't just detect by looking at the end what what you'd have to do you'd have to be much more sneaky so in order to avoid detection detecting your detecting strategy so in order to avoid defenses against this um i would i would guess what you want to do is not just you know make the network such that in the end it's fairly obvious if by looking at this last matrix maybe you should only be able to detect this uh at the end by actually feeding data into it like we did with the black box test but if we had a white box test by feeding data into it and then um and then looking at the responses of the network so but someone couldn't not tell it was trained with radioactive data by just looking at the network's weights so maybe one idea would be that you craft inputs in some way that correlates two of the hidden features so let's say we have some hidden layer here and one here and these features are learned by the network right and they appear to be fairly independent so you make sure that they are fairly independent during if you pass regular data and then you craft data specifically you craft data like you did here with the marking that makes the network correlate the two features but has little effect actually on the output distribution of the classes so you can retain your generalization much more right it doesn't change this last layer necessarily that much or not in a completely class dependent fashion what i would simply do is i would correlate two of these internal features i would force the network to learn to correlate them and because then i would expect this to be much more you know secretive and then at test time i can simply introduce my forge data again and look whether or not the internal responses are actually correlated um as i said i could do this across classes to cancel out the effect of this actually being a feature for one given class and therefore changing the network's accuracy too much i think that would be a cool next direction to go into and again this should work because even the intermediate features we have good reason to assume that different networks even different architectures different training runs learn the same kind of intermediate features the question is only in the next network that feature could actually be like you know two layers up or three layers down or and so on so you'd have to learn some kind of more sophisticated alignment there but still i think that would be um kind of an iteration of this which would be cool um you know if if you're doing this inside the channel um yeah all right so that was it uh for me for this paper as i said pretty simple paper pretty cool idea and i'll see you next time bye\n"

Radioactive data - tracing through training (Paper Explained)

Random Videos