Differential Privacy Theory & Practice with Aaron Roth - #132

The Concept of Differential Privacy in Machine Learning

In an academic setting, I had the opportunity to work with a former student, Stephen Wu, who is now a postdoc at Microsoft Research. During his time here, he collaborated with colleagues from the medical school professor Casey Greene to construct differentially private synthetic medical data sets. This project highlights the importance of differential privacy in medicine, where there are significant concerns about data sharing and legal regulations.

Medicine is a field that has seen significant advancements in recent years, thanks to the application of machine learning technologies to large datasets. However, this increased data availability also presents challenges, particularly with regards to privacy concerns. Ideally, researchers would like to share their datasets with other experts to allow for reproducing results and combining datasets. Unfortunately, due to these concerns, sharing datasets is not always feasible.

Stephen and his colleagues provided a proof-of-concept that techniques for privately training neural networks could be combined with methods for generating synthetic data using Generative Adversarial Networks (GANs). This allowed them to create synthetic datasets that were indistinguishable from the original data sets, across a wide range of machine learning algorithms. These results demonstrate the potential for differential privacy in machine learning and its application to medicine.

The Future of Differential Privacy

The research on differential privacy is an ongoing area of study, with many different directions and applications. While there are some practical limitations to currently available methods, researchers continue to explore ways to improve efficiency and scalability. One key direction is the understanding of basic tasks in local models of differential privacy, which have proven to be important but less understood compared to their central model counterparts.

Understanding how differential privacy can be used to prevent false discovery and overfitting is a promising area of research. By exploring this connection between machine learning and differential privacy, researchers aim to address the statistical crisis in science. While progress has been made, there is still much work to be done to develop practical tools for working data scientists.

The Current State of Research

There are many different research areas focused on differential privacy and machine learning. Some studies have explored more theoretical aspects, while others have looked at practical applications and real-world problems. The author tends to focus on the more theoretical side of the problem but recognizes that developing practical solutions is also crucial.

One important direction in current research is the study of local models of differential privacy. Central models have been extensively studied, with many significant advances made in this area. However, the local model has proven to be a challenging field of study. Researchers continue to explore ways to understand and develop techniques for local differential privacy.

Research Agenda

Another promising direction is exploring the application of differential privacy to statistical crisis in science. By understanding how differential privacy can prevent false discovery and overfitting, researchers aim to address these significant issues in machine learning. While some progress has been made, there is still much work to be done to develop practical tools that can be used by data scientists.

Differential Privacy Center

For more information on the field of differential privacy and recent research, visit the Differential Privacy Resource Center at GPT, our SVC (https://gptoursvc.com). This website provides a wealth of information on this critical area of study.

"WEBVTTKind: captionsLanguage: enhello and welcome to another episode of twirl talk the podcast why interview interesting people doing interesting things in machine learning and artificial intelligence I'm your host Sam Cherrington this week on the podcast I'm excited to present a series of interviews exploring the emerging field of differential privacy over the course of the week we'll dig into some of the very exciting research and application work happening right now in the field in this our first episode of the series I'm joined by aaron roth associate professor of computer science and information science at the university of pennsylvania aaron is first and foremost a theoretician and our conversation starts with him helping us understand the context and theory behind differential privacy a research area he was fortunate to begin pursuing at it's very inception we explore the application of differential privacy to machine learning systems including the costs and challenges of doing so aaron discusses as well quite a few examples of differential privacy in action including work being done at Google Apple and the US Census Bureau along with some of the major research directions currently being pursued in the field thanks to Georgian partners for their continued support of the podcast and for sponsoring this series Georgian partners is a venture capital firm that invests in growth stage companies in the US and Canada Post investment Jordan works closely with portfolio companies to accelerate adoption of key technologies including machine learning and differential privacy to help their portfolio companies provide privacy guarantees to their customers Georgian recently launched its first software product epsilon which is a differentially private machine learning solution you'll learn more about epsilon in my interview with Georgians Chang leo later this week but if you find this field interesting I'd encourage you to visit the differential privacy Resource Center they've set up at GPT our SVC / - moai and now on to the show all right everyone I am on the line with Aaron Roth Aaron is associate professor of computer science and information science at the University of Pennsylvania Aaron welcome to this weekend machine learning in AI thanks I'm glad to be here why don't we get started by having you tell us a little bit about your background and how you got involved in the machine learning field sure so I'm a computer scientist and I guess I would describe my background is really coming from theoretical computer science so as someone who you know sits down and tries to understand things by thinking mathematically and improving theorems and the way I came to machine learning in general is well from from a background in in learning theory and in particular the flavor of problems that I've studied both sort of historically and and now have to do with the way that machine learning as a technology interacts with things that you might more normally think of as societal concerns so things like privacy things like fairness and and things that may be more typically an economist would think about like how do machine learning algorithms work in strategic situations you're also very involved in the work happening around the application of differential privacy to machine learning how did you get started down that route well so I I started my PhD in 2006 which is the same year that the first paper on differential privacy was published by Cynthia to work and and kobena sim and Adam Smith and Frank Meharry and so this was a very new topic at the time that I was just starting to think about research and you know it struck me as timely and important and at the same time you know when I was just starting to think about it not much was known so it was sort of at the sweet spot for PhD theses you can study an important problem and and have a lot of impact without having to do anything too clever well maybe a good place to start in our exploration of differential privacy and machine learning would be to have you define differential privacy and tell us a little bit about the context in which it was created sure so privacy has a relatively long history in computer science but until people started thinking about differential privacy what people meant when they said privacy was some kind of syntactic constraints on what the output of a computation could look like and you know the it turns out these kinds of syntactic privacy guarantees don't have a strong meaning in terms of privacy and there was sort of a cat-and-mouse game in which people would attempt to share datasets with some kind of privacy protections and then some clever person would would come around and figure out how to get around those privacy protections and this would iterate can you give us an example of those types of syntactic constraints and you know a little bit of how that cat and mouse game evolved sure yeah so maybe the you know simplest thing that you might imagine this you might think to yourself well if I just remove any identifying attributes from a from a data set so for example if I've got a data set of medical records if I just remove things like name and zip code and maybe a few others that'll anonymize the data and then it'll be safe to release the data set in the clear and unfortunately that turns out not to be the case so there's a there's a bunch of examples of this sort but maybe the first one was a demonstration by Latanya Sweeney at the time she was a PhD student at MIT now she's a professor at Harvard and the state of Massachusetts had released a supposedly anonymized data set of of patient medical records so it didn't have people's names attached but what Latonya figured out was that there was another data set that was publicly available that was the voter registration records and Cambridge Massachusetts and you know when you've got two data sets and you know something about an individual for example Latonya knew that the governor of Massachusetts at the time William weld lived in in Cambridge any of you other things about him you can basically cross-reference these two data sets and reattach names so that's sort of the the simplest example for why attempting to remove identifying attributes doesn't work you know it seems like a good idea in isolation but in the real world there's all of this information out there that you can attempt to cross-reference with existing data sets I think another example along those lines was when Netflix released their anonymized recommendation data set for I think it was the Netflix prize some one or a set of people cross-reference that to IMDB and found that they were able to dia naanum eyes a pretty significant portion of those records yeah there was another high-profile example using a more sophisticated technique that was done by Arvind Narayana and who is now a professor at Princeton and V delish Monica who's at Cornell Tech that was another example where you know names were removed all that was made available was sort of a big data set where for each person now identified by a you know supposedly anonymous numeric identifier all you saw about them were which movies they watched what they rated them and approximately when they watched them and as you say by cross-referencing this data set with IMDB they were able to reattach names so differential privacy came about kind of in the wake of the broader realization of the failure of anonymization it sounds like exactly so I think the key insight that the creators of differential privacy had was that you know if you want to speak rigorously about what someone can infer about and and given what they observe about an algorithm you shouldn't be trying to put syntactic constraints on the output of that algorithm but rather you should be putting information theoretic constraints on the algorithmic process itself on the computation yeah and so that's exactly what differential privacy does what differential privacy means and formally it's a constraint on an algorithm and it says small changes in the input to an algorithm for example adding or removing the record of a single individual should have only small changes on the distribution of outputs that the algorithm produces so if I remove your record entirely from a data set that shouldn't cause any observable events anything that the algorithm might do when computing on the data set to become too much more or less likely and this kind of probabilistic information theoretic constraint turns out to have have a really strong semantics about what you know an adversary you know an attacker can infer about your data one of the subtleties in the way you describe that is that different and maybe it's not so subtle but differential privacy isn't an algorithm itself it's a constraint on an algorithm is that am i hearing that correctly that's right yes so differential privacy is a property that an algorithm you know might or might not have any particular algorithm might be differentially private or might not be and a lot of the into the definition of the constraint it's it's relatively simple but a lot of the science that goes into studying differential privacy asks the question you know if I've tied my hands in this way and what kinds of algorithms I can use what tasks can I still perform and as you say differential privacy is a family of it's a constraint it's not an algorithm so to show that you can do something subject to differential privacy it's sufficient to exhibit the differentially private algorithm that does that thing but to prove lower bounds to show that for some problem you cannot solve it subject to differential privacy that's all matter entirely you have to write down a mathematical proof that no algorithm could solve it subject to the constraint hmm so how does that relatively simple sounding constraint lead to the benefits of privacy I guess most it basically it provides a guarantee of plausible deniability so let me maybe to make things less abstract let me give you an example of a very simple intuitive differentially private algorithm so suppose that I want to conduct a poll and I want to find out amongst all of the citizens in Philadelphia how many of them voted for Donald Trump in the last election okay you know the most straightforward way to do this is I would call up some random sample of individuals on the phone and I'd ask them you know did you vote for Donald Trump in the 2016 election and I'd write down their answer on a on a piece of paper and then when I was all done when I'd called you know sufficiently many people I'd tally up their answers I'd find that you know 15% of people voted for Donald Trump I'd do some statistics to attach error bars to that and then I'd publish and publish the publish the statistic okay now note that like the thing that I wanted to find out was just this property of the distribution what fraction of people voted for Donald Trump but like incidentally along the way I accumulated this large body of potentially sensitive information what individual people voted for who individual people voted for right okay so think about the following alternative polling procedure which turns out to be differentially private and will allow us to figure out the distributional statistically care about what fraction of people voted for Donald Trump without having to collect you know sensitive information about individuals okay so what so I'm gonna again call up some large collection of people but now instead of telling them to instead of instructing them to tell me whether they voted for Donald Trump I'll tell them the following thing I'll say flip a coin if it comes up heads tell me truthfully whether you voted for Donald Trump or not if it comes up tails though tell me a random answer by which I mean flip the coin again and tell me Trump if it came up heads and tell me not Trump if it came up tails hmm so and importantly you're not gonna tell me how the coin came out okay so so I hear Trump or not Trump but I don't know how your coins were realized so I don't know whether you're telling me the truth or whether you're lying simply because of how the coins were flipped mm-hmm okay so on the one hand you now have a significant amount of plausible deniability okay if all of a sudden the country collapses into a into a police state and my polling records are subpoenaed and you're called in front of the truth commission and it's suggested that you voted in one way or the other you can reasonably say no I didn't the the answer that you're reading was simply the result of a coin flip because I was following this randomized protocol right okay so you have plausible deniability that's the guarantee of privacy on the other hand even though I cannot form strong beliefs about what any individuals data looks like in aggregate I can I can I can figure out to a high degree of accuracy what fraction of people voted for Donald Trump and that's because I understand the process by which noise was injected into these measurements okay so in aggregate the noise behaves in a very structured easy-to-understand way and I can subtract that noise out of the average even though I cannot figure out for any individual who they voted for that strikes me as somewhat counterintuitive and that you know thinking about the coin flip you know almost half of your data could be corrupted or a quarter maybe a quarter exactly so if you think about it right it's a simple calculation if I know that for every individual three quarters of the time they're telling me the truth and a quarter of the time they're telling me the opposite of the truth then say fifteen percent of people in Philadelphia truly voted for Donald you know I can write out on a piece of paper what percentage of people in expectation should report to me under this protocol voted for Donald Trump right and because I've got a bunch of independent samples the actual number of people who end up telling me this will be highly concentrated around its expectation right and therefore you know what I need to do as the pollsters work backwards what I know is the number of people who actually told me they voted for Donald Trump but because I can compute this one-to-one mapping between the number of people who really did and the number of people I expect to tell me this I can invert the mapping and figure out what the what the underlying population statistic was to a high degree of accuracy okay it sounds like you know the the method you you're describing could be used both prior to data collection by instructing your respondents to follow this algorithm or if you're an organization collecting data you could collect the actual responses and then apply this algorithm before publishing the data that's right so the interaction that I described to you was what's called the local model of differential privacy and you know in this scenario people's privacy was protected even from the pollster okay right he never wrote down the data and of course if that's what you want then you have to apply these protections when the data is being gathered right but as you say if I'm an organization that's already got the data I can apply privacy protections to the output of my computations to anything I release so then obviously privacy isn't protected from me the organization that has the data I've already got the data and the clear but assuming there are no data breaches and no subpoenas differential privacy can guarantee something about what any outside observer can learn about individuals in the data set by observing the outcomes of computations and you're careful to say guarantees something what exactly does differential privacy guarantee and what what does it not guarantee yeah so one thing that has been alighted in this discussion is that there's a qualitative parameter that comes with differential privacy it's called Epsilon but what differential privacy guarantees formally is that no event becomes much more likely if your data is in the data set compared to if it's not but what is much more likely mean that means more likely by some multiplicative factor that depends on this parameter Epsilon so suppose epsilon is small this is a pretty good guarantee because it says whatever it is that you're worried about from the perspective of privacy you know whatever whatever harm you're worried that the use of your data will cause to befall you that harm is even though I don't know what you're worried about I can promise you that the risk the increased risk of that harm but following you can be bounded as a function of this parameter Epsilon but of course if epsilon is is very big then that's not a very strong guarantee so it doesn't really mean anything if someone tells you that their algorithm is differentially private unless they also tell you what this privacy parameter is as the privacy parameter goes to infinity differential privacy is no constraint at all as it goes to zero you know it becomes a very strong constraint going back to this fundamental constraint it's that you know within the bounds of Epsilon adding or removing an individual piece of data won't change the statistics of your overall distribution is that correct that's right it won't change the behavior of your algorithm okay so netting right removing a single data point won't cause your algorithm to do something that is very different from the point of view of an observer and so how does how do we get from there to again the the notion of privacy and I guess you were setting that up in talking about the plausible deniability example yes so there's a couple of interpretations of differential privacy and I can I can walk through a few of them so one is this plausible deniability guarantee right so if if someone accuses you of having some particular data record and the piece of evidence they have at their disposal is the output of a differentially private computation you have plausible deniability in the sense that you can say that the piece of evidence they have in hand is essentially as likely to have been observed again up to this factor relating to the privacy parameter if your data had been different another way of saying this is you know suppose they've got some prior belief about what your data point looks like and then they observe the output of a computation and so they update their belief to some posterior belief using Bayes rule for example then what differential privacy promises is that they would have performed nearly the same update and therefore had nearly the same beliefs had your data have been different if we you know hold the rest of the data set fixed another interpretation is is this model of harm and you can you can view this as a sort of utilitarian definition of privacy yeah like how hard is it for me to convince you to contribute your data to my data set if I'm gonna use it for some statistical analysis well why wouldn't you want to contribute your data there's any number of reasons and I might not know what they are but you know presumably you're worried that as the result of the use of your data something bad is gonna happen to you maybe your health insurance premiums we're gonna rise or maybe you're gonna start getting drunk phone calls during dinner and what differential privacy can promise is no matter what event that you're worried about and I really mean no matter what event so we can talk about the probability that your health insurance premiums rise for the probability that you get spam phone calls this event will be will have almost the same probability up to again this privacy parameter in the following two hypothetical worlds in the one world you don't contribute your data to the computation and in the other world you do and every thing else has held constant between these two worlds that's the that's the difference in differential privacy right so if I look at the two different worlds that are identical except for this one fact whether you contributed your data to my analysis or whether you did not then the events that you're worried about whatever they are become almost no more likely when you contribute your data hmm and does that interpretation it sounds like that assumes some anonymization it doesn't assume anything all that follows sort of directly from the definition of differential privacy that if you like that is the definition of differential privacy I guess I think I maybe I'm thinking of this in in some perverse way but if I include my data and my data includes my phone number how does differential privacy address that oh well your data can include anything you like including your phone number but a differentially private algorithm certainly can't look at your data record and publish your phone number right and so is the idea that we're applying the the coin flip for example to the publishing you know maybe it would randomly spit out phone numbers or something like that yeah I think I'm getting stuck in a rat hole here but so maybe one thing that's useful to keep in mind you know you can try to write down a differential private algorithm for anything you like but it's only for certain kinds of problems for which differentially private algorithms are gonna be able to do anything useful and those are our statistical problems where the thing that you want to estimate is some property of the underlying distribution roads it's very good for sort of machine learning if I want to learn a classifier that on average makes correct predictions but if I want to learn what your phone number is you know it's all well and good that I want to learn that but by design you know there is no differentially private algorithm that will give me a non-trivial advantage over essentially random guessing differential privacy isn't compatible with with answering the kinds of questions that have to do with just a single individual and that's by design so that's a great segue to the applications of differential privacy in machine learning can you maybe start by talking about the specific machine learning you know problems or examples that differential privacy is is trying to address in that application and maybe talk through some of the specifics of how that's done sure so there's a couple of things that you might want to do subject to differential privacy when you're doing machine learning so one is just that you might want to solve a single machine learning problem subject to differential privacy so maybe you've got some for example supervised classification tasks you you've got some labeled data you'd like to learn you know a support vector machine or a neural network of some sort that will minimize some loss function maybe my classification error when I apply it to new data okay so that's the just the standard machine learning problem and differential privacy is extremely amenable to this kind of problem essentially because well there's several reasons but maybe the most fundamental reason is that this is inherently a statistical problem in the sense that the thing that I already for reasons of overfitting wanted to avoid when I'm solving a machine learning problem depending too heavily on a single data point right so so machine learning and privacy they're sort of a lined in the sense that they're both trying to learn facts about the distribution without overfitting to the particular data set I have on hand right overfitting is closely related to privacy violations and we can talk more about that the connection turns out to go both ways another thing that you might want to do is more ambitious you might want to construct a synthetic data set by which I mean like rather than solving a single machine learning problem maybe you want to construct a data set that looks like the real data set with respect to some large number of statistics or machine learning algorithms but it's nevertheless differentially private so I can construct this synthetic data set with a private algorithm and then release it to the world and then other people can go and try to apply their machine learning algorithms to this synthetic data set the hope being that insights that they derive you know classifiers they train on the synthetic data set would also work with the real data set mm and then and then finally and this relates back to the connections between differential privacy and overfitting it might be that you don't care about privacy at all in your application but you know you want to for example repeatedly test hypotheses or fit different kinds of machine learning models while reusing the same holdout set over and over again maybe because it's too expensive to get more data now normally this would be a really bad idea sort of the the test the standard test train methodology and machine learning entirely falls apart basically if you if you if you reuse the holdout set as part of an iterative procedure by which you're choosing your model but as it turns out when you perform your entire pipeline of statistical analyses subject to the guarantees of differential privacy you can't over fit so you can be accurate in sample you can be guaranteed that you learn an accurate model out-of-sample even if you've repeatedly reused the data and when you say perform your entire pipeline subject to the guarantees of differential privacy does that mean you are enforcing those constraints at every individual step or the you know relative to the inputs and outputs of the entire pipeline it means that you know everything you shouldn't have touched the data in any way using a non differentially private algorithm so differential privacy has a very nice property that it composes if I have two algorithms you know the first one is epsilon 1 differentially private and the second one is epsilon 2 differentially private then if I run the first one and based on the outcome decide what I want to do at the second step and then I run the second algorithm this whole computation is in aggregate epsilon 1 plus epsilon 2 differentially private so you know if at every decision point I'm making my decision about what to do next as a function only of differentially private access to the data then you've got these strong safety guarantees about overfitting so maybe to make it a little bit more concrete I I've heard a few examples of scenarios that pop up in the machine learning world and I'm vaguely recalling them maybe you can provide a bit more detail but one of them was an example of a it's almost like reverse engineering an object detector to determine whether an individual or a face detector to determine whether an individual face was in the training data set okay so you're talking about maybe an attack on a classifier that wasn't trained in a differentially private way and the kind of thing that you might the kind of reason why you might want to have privacy guarantees when you're training a learning algorithm that they don't come for free if I'm exactly exactly yes so I think there's a couple of these kinds of attacks now and I don't know the details of the specific one you're referring to but you might have you know a priori before you started worrying about privacy think that you know okay maybe if I'm you know releasing individual data records like in the Netflix example I have to worry about privacy violations but if I'm if I'm just training a classifier how could how could releasing only the model parameters you know the weights in the neural network possibly violate anyone's privacy exactly yeah and that intuition is wrong and you're describing the kind of attacks that that show that it's wrong but the basic I think premise underlying these attacks is that when you train a model without differential privacy it'll tend to overfit a little bit even if even if this doesn't really affect the model performance but what you find is that you know when you try to classify a face for example that was in the training data set the model will tend to have higher confidence in its classification than when you try to classify an example that was not in the training dataset okay sort of natural that you would expect that because the model got to update itself on the examples in the training set right and by taking advantage of that you can therefore figure out whether a particular person's picture was in the training data set or not by examining what is the confidence in the models prediction mm-hmm okay are there other examples that come to mind of you know where the the notion of distributing models or are you know more generally I guess statistical aggregates can fail to be privacy-preserving so maybe the most obvious example is is sort of naive training of a support vector machine so the simplest way to you know the most concise way for me to describe to you the model of a trend support vector machine is by commuting communicating to you the support vectors but the support vectors are just examples from the training data sets so you know the most the most straightforward way to distribute a train support vector machine is to transmit to you some number of examples from the training data set in the clear mhmmm so that's sort of obvious once you realize it but you know is one of the things that you might not have thought of initially when you if you're coming from this position that things that represent just aggregate statistics like trend models shouldn't be dis closest okay okay what I'm hearing is you know I guess granted some in some classes of problem maybe privacy isn't you know the the greatest concern but if differential privacy were free and easy to apply everywhere you know I might do that what are some of the you know the issues or costs of applying differential privacy that come up when trying to apply it in the machine learning context yeah so it definitely doesn't come for free and I think there's costs of two sorts so the first is sort of maybe it's difficult to just acquire the expertise to implement all of these things at the moment you know a lot of the knowledge about differential privacy at the moment is contained and hard to read you know academic papers there's not that many people who are trained to read these things so so if you're just some ran company it can be hard to even get started but maybe the more fundamental thing is that although differential privacy is is compatible with machine learning by which I mean in principle anything that you can any statistical problem that is susceptible to machine learning absent differential privacy's you know can also be solved with machine learning with differential privacy guarantees the cost is that if you want strong differential privacy guarantees you'll need more data and if you want the privacy parameter to be small this thing that governs the strength of the privacy guarantee you might need a lot more data to achieve the same accuracy guarantees so as a result it can be a tough sell to apply privacy technologies in a setting in which you know developers and researchers already have direct access to the data set because the data set is not getting any bigger so if yesterday you could do your statistical analyses on your data set of ten thousand records and today I say now you've got to do it subject to differential privacy the accuracy of your analyses is going to degrade the the place in which I've seen it successfully deployed overcoming sort of this kind of objection in industry has been in settings where because of privacy concerns developers previously didn't have access to the data at all and they're now you know once privacy strong privacy guarantees are built in are able to start collecting it so much it's a much easier sell if the privacy guarantees are going to give you access to new data sets that previously you couldn't touch because of privacy concerns than it is to sort of add-on ex-post when previously you were able to ignore privacy or the cost of privacy will tend to come in the form of of less accuracy in terms of your you know classification error for example okay some of the known uses of differential privacy here at places like Google Apple Microsoft on the US Census Bureau are there are you familiar with those examples and what they're doing and can you talk about the ones that you are sure so Google and Apple are both using differential privacy in the local model this this model of the the polling agency trying to figure out how many people voted for Donald Trump in the example that I give ok so both of them are collecting statistics Google in the in your Chrome web browser and Apple on your iPhone in which the privacy protections are added on device and what they're trying to do or are collect simple statistics population wide averages so if you look at the the Apple paper for example they're collecting statistics on you know like what are the most frequently used emojis in different countries or for different web sites what fraction of people like it when videos automatically play as opposed to requiring some human intervention so they're trying to collect population wide statistics that allow them to improve user experience or you know improve things like you know predictive typing and things like that mm-hmm the US Census is doing something more ambitious and you know the US Census collects all the data in the clear right so so we're not prettier they're not trying to protect the privacy of your data from the census they're collecting it instead they're using differential privacy and in the centralized model but they release huge amounts of statistical information so you know you can go on existing census websites and you know figure out question you figure out the answers to questions like you know how many people live in this collection of zip codes and work in this other collection of zip codes okay and they're gonna continue releasing these large amounts of of statistical information about the US population but for the 2020 census they're going to release it subject to differential privacy protections interesting and so they're releasing not individual data records but more these statistical aggregates subject to difference for privacy that's right so in all of these applications what's being released or statistics rather than actual synthetic data sets right as far as I know I don't know the details of what the census plans to do and I'm not sure that's even been pinned down in an academic setting I have a former student Stephen Wu who's now a postdoc at Microsoft Research but when he was here he worked with colleagues in the medical school professor named Casey Greene to construct differentially private synthetic medical data sets so medicine is a field that's got a big problem in that there's a lot of data and it's starting to be susceptible to yielding all sorts of wonderful insights if we apply the latest machine learning technologies to it but medicine is a domain where there are serious privacy concerns and legal regulations and so it's very difficult to share datasets ideally you'd like to share your datasets with other researchers allow them to reproduce the kinds of garments you did on the data combine datasets and so what Stephen and his colleagues did was a I gave sort of a proof of concept that you could use techniques for privately training neural networks and combine those with techniques for generating synthetic data for for training Gans that would let you create synthetic datasets that you could release to the public but that would look like the original data set with respect to a very large class of machine learning algorithms so you could train the algorithms on the synthetic data and then you know find that when you evaluated them on the real data they did pretty well ok interesting interesting this is the kind of thing this is sort of the more ambitious kind of technology that I think as far as I know has so far been the domain of only academic research but you know maybe in the in the coming years we'll find you know industrial and meant applications can you maybe share a brief word on the current research areas and and around differential privacy and machine learning so you know there there are many and diverse and there are people focused on more practical problems and more theoretical problems I myself you know just through my natural proclivities tend to focus on sort of the more theoretical problems but I think that it remains even though it's an old problem it remains an important and unsolved problem to figure out sort of practical ways to generate useful synthetic data for large collections of tasks we know we've known for a while you know since my PhD thesis that these kinds of problems are possible in principle there are inefficient information theoretic you know kinds of algorithms to accomplish them but we don't yet have practical algorithms I think that remains very important you know another important direction is is that a lot of the academic literature today it has really focused on this central model of data privacy where there's a trusted database curator who gathers all the data in the clear in part because you can do more stuff in that model so it's attractive to study it but as we've seen differential privacy move from theory to practice you know to date it's to largest scale deployments that Google and Apple are both in the local model and there's a lot of things I think that we understand in the central model of differential privacy that we we still don't understand in the local model and that's too bad because the local models turning out to be very important so I think understanding basic tasks in the local model continues to be very important and I mentioned briefly this sort of research agenda showing that you can use differential privacy to avoid false discovery and overfitting even when you don't care about privacy I think this is one of the most general promising directions to you know attack the the statistical crisis in science but so far we're just in early days you know we have we understand the basic sort of proof-of-concept kinds of results for Y techniques from differential privacy might be useful but we're a pretty far way off from having practical tools that you know working data scientists can use to to prevent overfitting and with practically size datasets okay right we will share a link to your website so folks can take a look at some of your recent work and publications in this area before we close up would you is there anything else that you'd like to share with the audience nope thanks for thanks for listening and you know I'm always happy to hear about interesting new applications of differential privacy so feel free to send me emails when I when I was just getting started and writing my PhD thesis you know all of this was a was a theoretical abstraction and and it's been great fun hearing about and consulting with with companies that are actually putting this into practice so it's been a fun ride and I I like to hear what's going on out there fantastic well thanks so much Erin thank you all right everyone that's our show for today for more information on Erin or any of the topics covered in this episode head on over to twibell a i.com slash tok slash 132 thanks again to our friends at georgian for sponsoring this series and be sure to visit their differential privacy Resource Center at GPT our SVC slash twill AI for more information on the field and what they're up to and of course thank you so much for listening and catch you next timehello and welcome to another episode of twirl talk the podcast why interview interesting people doing interesting things in machine learning and artificial intelligence I'm your host Sam Cherrington this week on the podcast I'm excited to present a series of interviews exploring the emerging field of differential privacy over the course of the week we'll dig into some of the very exciting research and application work happening right now in the field in this our first episode of the series I'm joined by aaron roth associate professor of computer science and information science at the university of pennsylvania aaron is first and foremost a theoretician and our conversation starts with him helping us understand the context and theory behind differential privacy a research area he was fortunate to begin pursuing at it's very inception we explore the application of differential privacy to machine learning systems including the costs and challenges of doing so aaron discusses as well quite a few examples of differential privacy in action including work being done at Google Apple and the US Census Bureau along with some of the major research directions currently being pursued in the field thanks to Georgian partners for their continued support of the podcast and for sponsoring this series Georgian partners is a venture capital firm that invests in growth stage companies in the US and Canada Post investment Jordan works closely with portfolio companies to accelerate adoption of key technologies including machine learning and differential privacy to help their portfolio companies provide privacy guarantees to their customers Georgian recently launched its first software product epsilon which is a differentially private machine learning solution you'll learn more about epsilon in my interview with Georgians Chang leo later this week but if you find this field interesting I'd encourage you to visit the differential privacy Resource Center they've set up at GPT our SVC / - moai and now on to the show all right everyone I am on the line with Aaron Roth Aaron is associate professor of computer science and information science at the University of Pennsylvania Aaron welcome to this weekend machine learning in AI thanks I'm glad to be here why don't we get started by having you tell us a little bit about your background and how you got involved in the machine learning field sure so I'm a computer scientist and I guess I would describe my background is really coming from theoretical computer science so as someone who you know sits down and tries to understand things by thinking mathematically and improving theorems and the way I came to machine learning in general is well from from a background in in learning theory and in particular the flavor of problems that I've studied both sort of historically and and now have to do with the way that machine learning as a technology interacts with things that you might more normally think of as societal concerns so things like privacy things like fairness and and things that may be more typically an economist would think about like how do machine learning algorithms work in strategic situations you're also very involved in the work happening around the application of differential privacy to machine learning how did you get started down that route well so I I started my PhD in 2006 which is the same year that the first paper on differential privacy was published by Cynthia to work and and kobena sim and Adam Smith and Frank Meharry and so this was a very new topic at the time that I was just starting to think about research and you know it struck me as timely and important and at the same time you know when I was just starting to think about it not much was known so it was sort of at the sweet spot for PhD theses you can study an important problem and and have a lot of impact without having to do anything too clever well maybe a good place to start in our exploration of differential privacy and machine learning would be to have you define differential privacy and tell us a little bit about the context in which it was created sure so privacy has a relatively long history in computer science but until people started thinking about differential privacy what people meant when they said privacy was some kind of syntactic constraints on what the output of a computation could look like and you know the it turns out these kinds of syntactic privacy guarantees don't have a strong meaning in terms of privacy and there was sort of a cat-and-mouse game in which people would attempt to share datasets with some kind of privacy protections and then some clever person would would come around and figure out how to get around those privacy protections and this would iterate can you give us an example of those types of syntactic constraints and you know a little bit of how that cat and mouse game evolved sure yeah so maybe the you know simplest thing that you might imagine this you might think to yourself well if I just remove any identifying attributes from a from a data set so for example if I've got a data set of medical records if I just remove things like name and zip code and maybe a few others that'll anonymize the data and then it'll be safe to release the data set in the clear and unfortunately that turns out not to be the case so there's a there's a bunch of examples of this sort but maybe the first one was a demonstration by Latanya Sweeney at the time she was a PhD student at MIT now she's a professor at Harvard and the state of Massachusetts had released a supposedly anonymized data set of of patient medical records so it didn't have people's names attached but what Latonya figured out was that there was another data set that was publicly available that was the voter registration records and Cambridge Massachusetts and you know when you've got two data sets and you know something about an individual for example Latonya knew that the governor of Massachusetts at the time William weld lived in in Cambridge any of you other things about him you can basically cross-reference these two data sets and reattach names so that's sort of the the simplest example for why attempting to remove identifying attributes doesn't work you know it seems like a good idea in isolation but in the real world there's all of this information out there that you can attempt to cross-reference with existing data sets I think another example along those lines was when Netflix released their anonymized recommendation data set for I think it was the Netflix prize some one or a set of people cross-reference that to IMDB and found that they were able to dia naanum eyes a pretty significant portion of those records yeah there was another high-profile example using a more sophisticated technique that was done by Arvind Narayana and who is now a professor at Princeton and V delish Monica who's at Cornell Tech that was another example where you know names were removed all that was made available was sort of a big data set where for each person now identified by a you know supposedly anonymous numeric identifier all you saw about them were which movies they watched what they rated them and approximately when they watched them and as you say by cross-referencing this data set with IMDB they were able to reattach names so differential privacy came about kind of in the wake of the broader realization of the failure of anonymization it sounds like exactly so I think the key insight that the creators of differential privacy had was that you know if you want to speak rigorously about what someone can infer about and and given what they observe about an algorithm you shouldn't be trying to put syntactic constraints on the output of that algorithm but rather you should be putting information theoretic constraints on the algorithmic process itself on the computation yeah and so that's exactly what differential privacy does what differential privacy means and formally it's a constraint on an algorithm and it says small changes in the input to an algorithm for example adding or removing the record of a single individual should have only small changes on the distribution of outputs that the algorithm produces so if I remove your record entirely from a data set that shouldn't cause any observable events anything that the algorithm might do when computing on the data set to become too much more or less likely and this kind of probabilistic information theoretic constraint turns out to have have a really strong semantics about what you know an adversary you know an attacker can infer about your data one of the subtleties in the way you describe that is that different and maybe it's not so subtle but differential privacy isn't an algorithm itself it's a constraint on an algorithm is that am i hearing that correctly that's right yes so differential privacy is a property that an algorithm you know might or might not have any particular algorithm might be differentially private or might not be and a lot of the into the definition of the constraint it's it's relatively simple but a lot of the science that goes into studying differential privacy asks the question you know if I've tied my hands in this way and what kinds of algorithms I can use what tasks can I still perform and as you say differential privacy is a family of it's a constraint it's not an algorithm so to show that you can do something subject to differential privacy it's sufficient to exhibit the differentially private algorithm that does that thing but to prove lower bounds to show that for some problem you cannot solve it subject to differential privacy that's all matter entirely you have to write down a mathematical proof that no algorithm could solve it subject to the constraint hmm so how does that relatively simple sounding constraint lead to the benefits of privacy I guess most it basically it provides a guarantee of plausible deniability so let me maybe to make things less abstract let me give you an example of a very simple intuitive differentially private algorithm so suppose that I want to conduct a poll and I want to find out amongst all of the citizens in Philadelphia how many of them voted for Donald Trump in the last election okay you know the most straightforward way to do this is I would call up some random sample of individuals on the phone and I'd ask them you know did you vote for Donald Trump in the 2016 election and I'd write down their answer on a on a piece of paper and then when I was all done when I'd called you know sufficiently many people I'd tally up their answers I'd find that you know 15% of people voted for Donald Trump I'd do some statistics to attach error bars to that and then I'd publish and publish the publish the statistic okay now note that like the thing that I wanted to find out was just this property of the distribution what fraction of people voted for Donald Trump but like incidentally along the way I accumulated this large body of potentially sensitive information what individual people voted for who individual people voted for right okay so think about the following alternative polling procedure which turns out to be differentially private and will allow us to figure out the distributional statistically care about what fraction of people voted for Donald Trump without having to collect you know sensitive information about individuals okay so what so I'm gonna again call up some large collection of people but now instead of telling them to instead of instructing them to tell me whether they voted for Donald Trump I'll tell them the following thing I'll say flip a coin if it comes up heads tell me truthfully whether you voted for Donald Trump or not if it comes up tails though tell me a random answer by which I mean flip the coin again and tell me Trump if it came up heads and tell me not Trump if it came up tails hmm so and importantly you're not gonna tell me how the coin came out okay so so I hear Trump or not Trump but I don't know how your coins were realized so I don't know whether you're telling me the truth or whether you're lying simply because of how the coins were flipped mm-hmm okay so on the one hand you now have a significant amount of plausible deniability okay if all of a sudden the country collapses into a into a police state and my polling records are subpoenaed and you're called in front of the truth commission and it's suggested that you voted in one way or the other you can reasonably say no I didn't the the answer that you're reading was simply the result of a coin flip because I was following this randomized protocol right okay so you have plausible deniability that's the guarantee of privacy on the other hand even though I cannot form strong beliefs about what any individuals data looks like in aggregate I can I can I can figure out to a high degree of accuracy what fraction of people voted for Donald Trump and that's because I understand the process by which noise was injected into these measurements okay so in aggregate the noise behaves in a very structured easy-to-understand way and I can subtract that noise out of the average even though I cannot figure out for any individual who they voted for that strikes me as somewhat counterintuitive and that you know thinking about the coin flip you know almost half of your data could be corrupted or a quarter maybe a quarter exactly so if you think about it right it's a simple calculation if I know that for every individual three quarters of the time they're telling me the truth and a quarter of the time they're telling me the opposite of the truth then say fifteen percent of people in Philadelphia truly voted for Donald you know I can write out on a piece of paper what percentage of people in expectation should report to me under this protocol voted for Donald Trump right and because I've got a bunch of independent samples the actual number of people who end up telling me this will be highly concentrated around its expectation right and therefore you know what I need to do as the pollsters work backwards what I know is the number of people who actually told me they voted for Donald Trump but because I can compute this one-to-one mapping between the number of people who really did and the number of people I expect to tell me this I can invert the mapping and figure out what the what the underlying population statistic was to a high degree of accuracy okay it sounds like you know the the method you you're describing could be used both prior to data collection by instructing your respondents to follow this algorithm or if you're an organization collecting data you could collect the actual responses and then apply this algorithm before publishing the data that's right so the interaction that I described to you was what's called the local model of differential privacy and you know in this scenario people's privacy was protected even from the pollster okay right he never wrote down the data and of course if that's what you want then you have to apply these protections when the data is being gathered right but as you say if I'm an organization that's already got the data I can apply privacy protections to the output of my computations to anything I release so then obviously privacy isn't protected from me the organization that has the data I've already got the data and the clear but assuming there are no data breaches and no subpoenas differential privacy can guarantee something about what any outside observer can learn about individuals in the data set by observing the outcomes of computations and you're careful to say guarantees something what exactly does differential privacy guarantee and what what does it not guarantee yeah so one thing that has been alighted in this discussion is that there's a qualitative parameter that comes with differential privacy it's called Epsilon but what differential privacy guarantees formally is that no event becomes much more likely if your data is in the data set compared to if it's not but what is much more likely mean that means more likely by some multiplicative factor that depends on this parameter Epsilon so suppose epsilon is small this is a pretty good guarantee because it says whatever it is that you're worried about from the perspective of privacy you know whatever whatever harm you're worried that the use of your data will cause to befall you that harm is even though I don't know what you're worried about I can promise you that the risk the increased risk of that harm but following you can be bounded as a function of this parameter Epsilon but of course if epsilon is is very big then that's not a very strong guarantee so it doesn't really mean anything if someone tells you that their algorithm is differentially private unless they also tell you what this privacy parameter is as the privacy parameter goes to infinity differential privacy is no constraint at all as it goes to zero you know it becomes a very strong constraint going back to this fundamental constraint it's that you know within the bounds of Epsilon adding or removing an individual piece of data won't change the statistics of your overall distribution is that correct that's right it won't change the behavior of your algorithm okay so netting right removing a single data point won't cause your algorithm to do something that is very different from the point of view of an observer and so how does how do we get from there to again the the notion of privacy and I guess you were setting that up in talking about the plausible deniability example yes so there's a couple of interpretations of differential privacy and I can I can walk through a few of them so one is this plausible deniability guarantee right so if if someone accuses you of having some particular data record and the piece of evidence they have at their disposal is the output of a differentially private computation you have plausible deniability in the sense that you can say that the piece of evidence they have in hand is essentially as likely to have been observed again up to this factor relating to the privacy parameter if your data had been different another way of saying this is you know suppose they've got some prior belief about what your data point looks like and then they observe the output of a computation and so they update their belief to some posterior belief using Bayes rule for example then what differential privacy promises is that they would have performed nearly the same update and therefore had nearly the same beliefs had your data have been different if we you know hold the rest of the data set fixed another interpretation is is this model of harm and you can you can view this as a sort of utilitarian definition of privacy yeah like how hard is it for me to convince you to contribute your data to my data set if I'm gonna use it for some statistical analysis well why wouldn't you want to contribute your data there's any number of reasons and I might not know what they are but you know presumably you're worried that as the result of the use of your data something bad is gonna happen to you maybe your health insurance premiums we're gonna rise or maybe you're gonna start getting drunk phone calls during dinner and what differential privacy can promise is no matter what event that you're worried about and I really mean no matter what event so we can talk about the probability that your health insurance premiums rise for the probability that you get spam phone calls this event will be will have almost the same probability up to again this privacy parameter in the following two hypothetical worlds in the one world you don't contribute your data to the computation and in the other world you do and every thing else has held constant between these two worlds that's the that's the difference in differential privacy right so if I look at the two different worlds that are identical except for this one fact whether you contributed your data to my analysis or whether you did not then the events that you're worried about whatever they are become almost no more likely when you contribute your data hmm and does that interpretation it sounds like that assumes some anonymization it doesn't assume anything all that follows sort of directly from the definition of differential privacy that if you like that is the definition of differential privacy I guess I think I maybe I'm thinking of this in in some perverse way but if I include my data and my data includes my phone number how does differential privacy address that oh well your data can include anything you like including your phone number but a differentially private algorithm certainly can't look at your data record and publish your phone number right and so is the idea that we're applying the the coin flip for example to the publishing you know maybe it would randomly spit out phone numbers or something like that yeah I think I'm getting stuck in a rat hole here but so maybe one thing that's useful to keep in mind you know you can try to write down a differential private algorithm for anything you like but it's only for certain kinds of problems for which differentially private algorithms are gonna be able to do anything useful and those are our statistical problems where the thing that you want to estimate is some property of the underlying distribution roads it's very good for sort of machine learning if I want to learn a classifier that on average makes correct predictions but if I want to learn what your phone number is you know it's all well and good that I want to learn that but by design you know there is no differentially private algorithm that will give me a non-trivial advantage over essentially random guessing differential privacy isn't compatible with with answering the kinds of questions that have to do with just a single individual and that's by design so that's a great segue to the applications of differential privacy in machine learning can you maybe start by talking about the specific machine learning you know problems or examples that differential privacy is is trying to address in that application and maybe talk through some of the specifics of how that's done sure so there's a couple of things that you might want to do subject to differential privacy when you're doing machine learning so one is just that you might want to solve a single machine learning problem subject to differential privacy so maybe you've got some for example supervised classification tasks you you've got some labeled data you'd like to learn you know a support vector machine or a neural network of some sort that will minimize some loss function maybe my classification error when I apply it to new data okay so that's the just the standard machine learning problem and differential privacy is extremely amenable to this kind of problem essentially because well there's several reasons but maybe the most fundamental reason is that this is inherently a statistical problem in the sense that the thing that I already for reasons of overfitting wanted to avoid when I'm solving a machine learning problem depending too heavily on a single data point right so so machine learning and privacy they're sort of a lined in the sense that they're both trying to learn facts about the distribution without overfitting to the particular data set I have on hand right overfitting is closely related to privacy violations and we can talk more about that the connection turns out to go both ways another thing that you might want to do is more ambitious you might want to construct a synthetic data set by which I mean like rather than solving a single machine learning problem maybe you want to construct a data set that looks like the real data set with respect to some large number of statistics or machine learning algorithms but it's nevertheless differentially private so I can construct this synthetic data set with a private algorithm and then release it to the world and then other people can go and try to apply their machine learning algorithms to this synthetic data set the hope being that insights that they derive you know classifiers they train on the synthetic data set would also work with the real data set mm and then and then finally and this relates back to the connections between differential privacy and overfitting it might be that you don't care about privacy at all in your application but you know you want to for example repeatedly test hypotheses or fit different kinds of machine learning models while reusing the same holdout set over and over again maybe because it's too expensive to get more data now normally this would be a really bad idea sort of the the test the standard test train methodology and machine learning entirely falls apart basically if you if you if you reuse the holdout set as part of an iterative procedure by which you're choosing your model but as it turns out when you perform your entire pipeline of statistical analyses subject to the guarantees of differential privacy you can't over fit so you can be accurate in sample you can be guaranteed that you learn an accurate model out-of-sample even if you've repeatedly reused the data and when you say perform your entire pipeline subject to the guarantees of differential privacy does that mean you are enforcing those constraints at every individual step or the you know relative to the inputs and outputs of the entire pipeline it means that you know everything you shouldn't have touched the data in any way using a non differentially private algorithm so differential privacy has a very nice property that it composes if I have two algorithms you know the first one is epsilon 1 differentially private and the second one is epsilon 2 differentially private then if I run the first one and based on the outcome decide what I want to do at the second step and then I run the second algorithm this whole computation is in aggregate epsilon 1 plus epsilon 2 differentially private so you know if at every decision point I'm making my decision about what to do next as a function only of differentially private access to the data then you've got these strong safety guarantees about overfitting so maybe to make it a little bit more concrete I I've heard a few examples of scenarios that pop up in the machine learning world and I'm vaguely recalling them maybe you can provide a bit more detail but one of them was an example of a it's almost like reverse engineering an object detector to determine whether an individual or a face detector to determine whether an individual face was in the training data set okay so you're talking about maybe an attack on a classifier that wasn't trained in a differentially private way and the kind of thing that you might the kind of reason why you might want to have privacy guarantees when you're training a learning algorithm that they don't come for free if I'm exactly exactly yes so I think there's a couple of these kinds of attacks now and I don't know the details of the specific one you're referring to but you might have you know a priori before you started worrying about privacy think that you know okay maybe if I'm you know releasing individual data records like in the Netflix example I have to worry about privacy violations but if I'm if I'm just training a classifier how could how could releasing only the model parameters you know the weights in the neural network possibly violate anyone's privacy exactly yeah and that intuition is wrong and you're describing the kind of attacks that that show that it's wrong but the basic I think premise underlying these attacks is that when you train a model without differential privacy it'll tend to overfit a little bit even if even if this doesn't really affect the model performance but what you find is that you know when you try to classify a face for example that was in the training data set the model will tend to have higher confidence in its classification than when you try to classify an example that was not in the training dataset okay sort of natural that you would expect that because the model got to update itself on the examples in the training set right and by taking advantage of that you can therefore figure out whether a particular person's picture was in the training data set or not by examining what is the confidence in the models prediction mm-hmm okay are there other examples that come to mind of you know where the the notion of distributing models or are you know more generally I guess statistical aggregates can fail to be privacy-preserving so maybe the most obvious example is is sort of naive training of a support vector machine so the simplest way to you know the most concise way for me to describe to you the model of a trend support vector machine is by commuting communicating to you the support vectors but the support vectors are just examples from the training data sets so you know the most the most straightforward way to distribute a train support vector machine is to transmit to you some number of examples from the training data set in the clear mhmmm so that's sort of obvious once you realize it but you know is one of the things that you might not have thought of initially when you if you're coming from this position that things that represent just aggregate statistics like trend models shouldn't be dis closest okay okay what I'm hearing is you know I guess granted some in some classes of problem maybe privacy isn't you know the the greatest concern but if differential privacy were free and easy to apply everywhere you know I might do that what are some of the you know the issues or costs of applying differential privacy that come up when trying to apply it in the machine learning context yeah so it definitely doesn't come for free and I think there's costs of two sorts so the first is sort of maybe it's difficult to just acquire the expertise to implement all of these things at the moment you know a lot of the knowledge about differential privacy at the moment is contained and hard to read you know academic papers there's not that many people who are trained to read these things so so if you're just some ran company it can be hard to even get started but maybe the more fundamental thing is that although differential privacy is is compatible with machine learning by which I mean in principle anything that you can any statistical problem that is susceptible to machine learning absent differential privacy's you know can also be solved with machine learning with differential privacy guarantees the cost is that if you want strong differential privacy guarantees you'll need more data and if you want the privacy parameter to be small this thing that governs the strength of the privacy guarantee you might need a lot more data to achieve the same accuracy guarantees so as a result it can be a tough sell to apply privacy technologies in a setting in which you know developers and researchers already have direct access to the data set because the data set is not getting any bigger so if yesterday you could do your statistical analyses on your data set of ten thousand records and today I say now you've got to do it subject to differential privacy the accuracy of your analyses is going to degrade the the place in which I've seen it successfully deployed overcoming sort of this kind of objection in industry has been in settings where because of privacy concerns developers previously didn't have access to the data at all and they're now you know once privacy strong privacy guarantees are built in are able to start collecting it so much it's a much easier sell if the privacy guarantees are going to give you access to new data sets that previously you couldn't touch because of privacy concerns than it is to sort of add-on ex-post when previously you were able to ignore privacy or the cost of privacy will tend to come in the form of of less accuracy in terms of your you know classification error for example okay some of the known uses of differential privacy here at places like Google Apple Microsoft on the US Census Bureau are there are you familiar with those examples and what they're doing and can you talk about the ones that you are sure so Google and Apple are both using differential privacy in the local model this this model of the the polling agency trying to figure out how many people voted for Donald Trump in the example that I give ok so both of them are collecting statistics Google in the in your Chrome web browser and Apple on your iPhone in which the privacy protections are added on device and what they're trying to do or are collect simple statistics population wide averages so if you look at the the Apple paper for example they're collecting statistics on you know like what are the most frequently used emojis in different countries or for different web sites what fraction of people like it when videos automatically play as opposed to requiring some human intervention so they're trying to collect population wide statistics that allow them to improve user experience or you know improve things like you know predictive typing and things like that mm-hmm the US Census is doing something more ambitious and you know the US Census collects all the data in the clear right so so we're not prettier they're not trying to protect the privacy of your data from the census they're collecting it instead they're using differential privacy and in the centralized model but they release huge amounts of statistical information so you know you can go on existing census websites and you know figure out question you figure out the answers to questions like you know how many people live in this collection of zip codes and work in this other collection of zip codes okay and they're gonna continue releasing these large amounts of of statistical information about the US population but for the 2020 census they're going to release it subject to differential privacy protections interesting and so they're releasing not individual data records but more these statistical aggregates subject to difference for privacy that's right so in all of these applications what's being released or statistics rather than actual synthetic data sets right as far as I know I don't know the details of what the census plans to do and I'm not sure that's even been pinned down in an academic setting I have a former student Stephen Wu who's now a postdoc at Microsoft Research but when he was here he worked with colleagues in the medical school professor named Casey Greene to construct differentially private synthetic medical data sets so medicine is a field that's got a big problem in that there's a lot of data and it's starting to be susceptible to yielding all sorts of wonderful insights if we apply the latest machine learning technologies to it but medicine is a domain where there are serious privacy concerns and legal regulations and so it's very difficult to share datasets ideally you'd like to share your datasets with other researchers allow them to reproduce the kinds of garments you did on the data combine datasets and so what Stephen and his colleagues did was a I gave sort of a proof of concept that you could use techniques for privately training neural networks and combine those with techniques for generating synthetic data for for training Gans that would let you create synthetic datasets that you could release to the public but that would look like the original data set with respect to a very large class of machine learning algorithms so you could train the algorithms on the synthetic data and then you know find that when you evaluated them on the real data they did pretty well ok interesting interesting this is the kind of thing this is sort of the more ambitious kind of technology that I think as far as I know has so far been the domain of only academic research but you know maybe in the in the coming years we'll find you know industrial and meant applications can you maybe share a brief word on the current research areas and and around differential privacy and machine learning so you know there there are many and diverse and there are people focused on more practical problems and more theoretical problems I myself you know just through my natural proclivities tend to focus on sort of the more theoretical problems but I think that it remains even though it's an old problem it remains an important and unsolved problem to figure out sort of practical ways to generate useful synthetic data for large collections of tasks we know we've known for a while you know since my PhD thesis that these kinds of problems are possible in principle there are inefficient information theoretic you know kinds of algorithms to accomplish them but we don't yet have practical algorithms I think that remains very important you know another important direction is is that a lot of the academic literature today it has really focused on this central model of data privacy where there's a trusted database curator who gathers all the data in the clear in part because you can do more stuff in that model so it's attractive to study it but as we've seen differential privacy move from theory to practice you know to date it's to largest scale deployments that Google and Apple are both in the local model and there's a lot of things I think that we understand in the central model of differential privacy that we we still don't understand in the local model and that's too bad because the local models turning out to be very important so I think understanding basic tasks in the local model continues to be very important and I mentioned briefly this sort of research agenda showing that you can use differential privacy to avoid false discovery and overfitting even when you don't care about privacy I think this is one of the most general promising directions to you know attack the the statistical crisis in science but so far we're just in early days you know we have we understand the basic sort of proof-of-concept kinds of results for Y techniques from differential privacy might be useful but we're a pretty far way off from having practical tools that you know working data scientists can use to to prevent overfitting and with practically size datasets okay right we will share a link to your website so folks can take a look at some of your recent work and publications in this area before we close up would you is there anything else that you'd like to share with the audience nope thanks for thanks for listening and you know I'm always happy to hear about interesting new applications of differential privacy so feel free to send me emails when I when I was just getting started and writing my PhD thesis you know all of this was a was a theoretical abstraction and and it's been great fun hearing about and consulting with with companies that are actually putting this into practice so it's been a fun ride and I I like to hear what's going on out there fantastic well thanks so much Erin thank you all right everyone that's our show for today for more information on Erin or any of the topics covered in this episode head on over to twibell a i.com slash tok slash 132 thanks again to our friends at georgian for sponsoring this series and be sure to visit their differential privacy Resource Center at GPT our SVC slash twill AI for more information on the field and what they're up to and of course thank you so much for listening and catch you next time\n"

Differential Privacy Theory & Practice with Aaron Roth - #132

Random Videos