#27 Data Security, Data Privacy and the GDPR (with Katharine Jarmul)

The Importance of Mindful Data Science: A Conversation with Kathryn at Ki Protect

As I sat down to talk with Kathryn from Ki Protect, I couldn't help but feel a sense of excitement and curiosity about the latest developments in data science. We were joined by Cynthia Twerk, primary author on a paper that explores the idea of differential privacy and its implications for ethics. This conversation highlighted the growing need for mindful data science, where researchers and practitioners alike are starting to think about the potential consequences of their work.

One of the most pressing issues in data science is the problem of security breaches. With sensitive data being stored on laptops, hard drives, or even in the cloud, there's always a risk that this information could be compromised. We discussed how this is not only a concern for individuals but also for organizations and companies that handle large amounts of personal data. Kathryn mentioned that Ki Protect is working to address these issues by providing tools and resources for protecting sensitive information.

However, we also touched on the idea that perhaps we don't need all the personal data we collect to build accurate models. Cynthia Twerk's paper explores the concept of differential privacy, which suggests that even with incomplete or partial data, it's still possible to achieve fair and unbiased results. This raises interesting questions about how we can privatize our data without sacrificing accuracy. Kathryn emphasized that this is a crucial area of research and development for Ki Protect.

Moreover, we discussed the importance of ethics in machine learning. As researchers and practitioners, it's essential to consider the implications of our work on individuals and society as a whole. Do we need to know someone's race or gender to build an accurate model? Can we really justify access to their personally identifiable information without consent? These are questions that require careful consideration and thoughtful discussion.

One of the most promising approaches is the idea of ethics by design. This means that data scientists and researchers should consider the potential consequences of their work from the outset, rather than treating it as an afterthought. Cynthia Twerk's paper highlights the importance of differential privacy in this context, suggesting that even with incomplete data, we can still achieve fair and unbiased results.

We also talked about the need for better documentation and systems for tracking and managing data. With the growing complexity of data science, it's becoming increasingly important to have robust tools and processes in place for monitoring and controlling data usage. Kathryn mentioned that this is an area where Ki Protect is focused on providing solutions and resources for practitioners.

As we concluded our conversation, I couldn't help but feel a sense of hope for the future of data science. While there are certainly challenges ahead, it's clear that the field is evolving rapidly and that researchers and practitioners alike are starting to take ethics and privacy more seriously. By working together and pushing the boundaries of what's possible, we can create a safer, more responsible, and more equitable data science community.

Final Call to Action

If you're interested in learning more about Ki Protect's work on data security and privacy, I encourage you to check out their website at ki protect comm. They're always looking for feedback and collaboration from the community, so don't hesitate to reach out. Additionally, if you're working within the data science space, I urge you to keep pushing forward on these issues. We need a vocal community of practitioners who are committed to ethics and privacy by design.

As Cynthia Twerk emphasized, this is not just about researchers or academics; it's also about practitioners who are building tools and solutions that will be used in the real world. By working together, we can create a future where data science is not only powerful but also responsible and ethical. Thank you to Kathryn for joining us on this conversation – I look forward to seeing the progress that Ki Protect makes in the years to come.

"WEBVTTKind: captionsLanguage: enin this episode of data framed a data cap podcast I'll be speaking with Kathryn Sharma a data scientist consultant educator and co-founder of ki protect a company that provides real-time protection for your data infrastructure data science and artificial intelligence Kathryn and I spoke on the 25th of May 2018 about data privacy data security and the GD P R or general data protection regulation which went into effect the week prior what are the biggest challenges currently facing data security and privacy what does the GD P R mean for civilians working data scientists and businesses around the world is data anonymization actually possible or a pipe dream stick around to find out I'm Hugo Bound Anderson a data scientist the data gap and this is data framed welcome to data frame a weekly data count podcast exploring what data science looks like on the ground for working data scientists and what problem is it consult I'm your host Hugo Bound Anderson you can follow me on Twitter as you go down and data camp at data count you can find all our episodes and show notes at data camp comm slash community slash podcast hi there Kathryn and welcome to data frames hey thanks for having me it's such a pleasure to have you on the show so before we dive in to everything we're talking about I just want to let you know I've been receiving far too many emails the past couple of weeks I know do you have any indication why yeah yeah emails that ask you asking for consent this is consensual email at its best it's emails asking you if you want to receive more emails you know and why is this happening what's been happening recently that means I'm getting my inbox is really full yeah so probably if you deal with data and you've already heard of gdpr or the general data protection regulation but it went into effect on May 25th and everybody got a lot of emails just talking about privacy it was really fantastic you know it felt like you know finally consensual data collection we're having lots of conversations but yeah I think most people just kind of deleted them all which is for better or worse maybe what what they all expected when it was sent as a mad rush on the final day for sure and it seems like a lot of its opt-in as well isn't it yeah so the idea behind gdpr is that it's essentially as I said like this consensual driven so consent driven in the sense that you have a right to say okay I'm fine with you using my data in these ways or I'm fine with you collecting my data in these ways or I'm fine with you reaching me out to me in these ways and I think that's a really great step I mean I think consent for everything is really cool Sept and consensual data collection is something that I think we are hopefully starting to realize is really Center of doing ethical data science and so I think gdpr is a nice step towards that it gives a lot of rights to European residents and I wonder how it will also affect data collection for the rest of the world it appears that you know people have different approaches for this as and some people are essentially creating EU only versions of their platform for example like the USA Today EU site and a few of the other papers and publications published an EU only package but what I am hoping is that it also allows for a little bit more consensual data collection of people even outside of the EU right I think this provides a really nice teaser of the conversation we'll be getting into with respect to data security privacy the GDP are whether it's enough what we might see outside the EU but before all of that I want to find out a bit about you and maybe you could start and tell us how you got into data science initially yeah so it was definitely by accident as we've kind of I guess I would say a lot of people in my era of computing which essentially was that I was a data journalist I was working at The Washington Post I had had some history and background in some computer science and statistics but I didn't enter it directly after school and and I found myself in data journalism after that I got recruited to work at a few startups doing some sort of larger scale data collection and data engineering essentially back in the initial Hadoop days and from there I went kind of into some ops and security roles automating deployment and kind of leading teams on those types of things and then fell back into Data doing some data wrangling after my book was published with Jacquie castle data wrangling with Python and since then have been focused more on natural language processing and machine learning and lately been thinking a lot about data privacy and data security kind of I guess after ten years in this business you start to like think about the intersections of things you care about and for me that's definitely an important intersection that I think ties in a lot of the passions I have an experience I have in data science as a whole and where is that led you know what are you working on currently yeah so I'm currently building a new startup called ki protects its ki after Kinsley's intelligence which is essentially the German translation of AI and our idea and our goal our solution really is to bring about kind of data science a data science compatible data security layer so the idea is from my experience and from our experience Andres Davis and I we have seen that the security community in the data science community are not necessarily like overlapping in meaningful ways right now and we're trying to think about how we can bring more data security and data privacy concepts to the data science community that makes them really easy to use really I would say like consumer-friendly in a sense of being easily integrate able into systems that you might use to process your data like Apache Kafka and spark and so forth and to make it so that you don't have to have data privacy or data security as a core concept of your data science team you can just do normal data science and you can use our service to help you enforce privacy and security great so it sounds like you're essentially trying to help people keep doing as they're doing and you'll fill in this particular gap for them with respect to data security and data privacy yeah that's the goal is to make it the plug-in for data security or data privacy of course this is a delicate and complex topic so we're exploring kind of what we can guarantee and what integrations make sense for different types of companies so we don't have like a full product spread available yet but this is something that we're actively experimenting with researching and working on we're fairly confident that we can come up with a few different methods that allow you to use simple api's for sudonym ization and anonymization of your data sets awesome so I want to now get into data security and data privacy but before that you mentioned you love the NLP and I just wanted to let everyone know that you've also got a great data camp course on fundamentals of NLP and Python which I had the great pleasure of working on with you yeah it was super fun and I loved all the great feedback and so forth from folks and yeah if you're starting to get into natural language processing or you're curious what it's all about I can definitely recommend taking that as well as the follow up courses which allow for some fun experimentation with some of the common and best libraries in natural language processing exactly so let's jump in what are the biggest challenges currently facing data security and data privacy in your mind yeah so I think one thing that I've noticed over time is the core competency of most data scientists is not necessarily focused on security and privacy now we're starting to see you know perhaps with for example the Apple differential privacy team and the Google brain research that has been focused on security and machine learning more overlap but the average person who has studied statistics or machine learning and who's doing this in the field they don't necessarily have a strong background in computer security or in data security or InfoSec as we might call it right and this is not their pro you know I don't see this as a fault of theirs it's nothing lacking right they have a lot of their own specialized training but the unfortunate circumstance of that is that a lot of the way that we manage and handled a it's not necessarily the most secure way and it definitely doesn't always take I the ideas of privacy or even user sensitivity in the sense of do I actually need access to full user data it doesn't really take these into account very often and therefore as data scientists you know we have access to potentially millions of people's personal data their messages their emails or chats their purchase history we have access to all of these things and you know kind of my question is is do we actually need access to all of these to do our job properly and I think this is perhaps a big oversight in terms of how we've built up the data management and data science and bi platforms that we use today you spoke to the lack of focus on knowledge with respect to data security do you think this is related to the lack of focus on kind of building ethical systems in general for data science well yeah I mean one of the conversations I've found myself having recently in lieu of the gdpr is that it's been really painful for people to implement consensual data collection in their data science and you know that why is that it's because the software is not designed with the user in mind right so the software's may be designed with the end-user the internal team in mind but is often not designed with the actual customer or the client in mind if we had a software that was slightly more driven by the clients desires or demands like this kind of touches upon design thinking right then it should be cognizant that when we collect user data that we have marked when they consented that we have marked what is the provenance of the data that we have marked how was the data processed and all of these things and the fact that you know data provenance has been more of an aspect of research than actually an aspect of every type of data collection software that you can imagine this is really problematic right because we have accumulated all this data and you know for some larger corporations sometimes they have purchased data or they have aggregated data from data marketplaces and so forth and this means that they now have all of this data some of which was given directly and consensually and some of it which was just collected by purchasing power or by buying another company and so forth and so you know this is a nightmare of course when it comes to GPR and you have to figure out and sort out what data was given by whom and under what circumstances but why why do we have a why might we have this problem in the first place like why can't we just have perhaps state marketplaces where consumers directly sell their data if they're going to do that or also why isn't data provenance and essentially like where this data come from and when does it expire how long is it good for why aren't these like a normal part of how we do data management from the beginning I'm interested in how you feel the average if this is even a well form question how the average data scientist responds to you know this type of legislation being passed and if you can't speak to the average maybe you could give a variety of responses that let you think of paradigms of how the community is responding yeah I mean I think that I guess I would say that I have a feeling people on are inherently good and want to build ethical systems this is like more the viewpoint that I'm coming from and I think that a lot of people are like okay this is painful but I want to be able to do the right thing I want to be able to do epical data science what does this mean how might I have to change the ways that I currently process data so I think it's sparking a lot of conversations that are thinking okay well perhaps in the past we haven't done this very well how might we start again or how might we better do this in the future but then I do think that there are some people that are just like I don't see it as a nuisance and you know there's been this big rash of variety of software and other platform vendors that are simply saying well we're not going to sell to EU residents anymore and I see this as terrifying right because why would I want to use a service that can't guarantee that they're going to ask me if they can use my data right this is I think you know shows that there's essentially I would I would argue that there's a big divide between those let's see privacy as a burden and those that see privacy as maybe something that we can strive for that we need to think about and perhaps change the systems and processes that we use in the mean time and how do you think data scientists generally feel about the idea of sacrificing some model performance for having more ethical models yeah I mean I think that this is difficult so I've spoken on the topic of ethical machine learning a few times now and a few times the reaction was very negative and people were like well I don't really see why this is my problem and I think that unfortunately there is some of that idea like well if black folks are treated differently by cops why why should I have to essentially change the distribution of my dataset to compensate for this right they say well well the data is there and that's what the data says and so I'm just going to build exactly what the data says and I would say that that is a choice in an action in and of itself and if you're making that choice in action you're essentially automating inequalities and you're automating biases societal biases and when you choose to do that you're making a statement and I would say that the statement is that you say that those biases are valid you say that it's valid that people are treated differently based on their skin color from police or that women earn less than men this is something that you're validating if you just say well that's what the underlying statistics of myself we'll say so that's what I'm gonna do and so I've definitely had those conversations numerous times and then I've also had conversations with people like how this this is really cool this makes sense like it's so nice to know that there's quite a lot of different performance metrics you can use to analyze the ethics or the treatment of different groups from your model and I think that there's also like new energy specifically around fad ml and everything that's happening in academia around finding real ways to build ethical models that don't necessarily sacrifice much performance at all yeah and I think something you mentioned there is that some people have responded it's not my job to to think about these things one thing that data science doesn't have as a profession yet is standards of practice codes of conduct necessarily and if we think back what's happened historically in other professions you know in ancient Greece the Hippocratic oath was developed to deal with these types of things for people practicing medicine right yeah and I think that you know if you're building some system that maybe controls and some IOT factory device where no humans are affected at all by what you're doing or if you're making some sort of academic model yeah okay maybe your impact is very small right but when we're building these systems that interact with humans and now quite a lot that interact directly with like we would say the consumer right or a person and affects maybe what the person sees what they click on what they think about what price they pay and then of course the massive systems like finance and justice and so forth this is like the impacts we have a growing footprint of the things that data science touches and affects and because of this I think that we need to start thinking about if we don't have a Hippocratic oath like what do we have right I do think there are so many more increasingly more and more such examples emerging I think one of the ones that I've mentioned a few times on on the podcast is judges using the output of blackbox model that tells recidivism rate for for incarcerated people using the output of that model as input for the parole hearing right actually Cathy O'Neal's book weapons of mass destruction which I recommend everyone who wants to think about these these things and actually probably recommend it for more more to people who don't want to think about these things to check out that book yeah there's a new one also called automating inequality which is quite good so I can recommend that one as well yeah we'll link to those in the show notes we'll jump right back into our interview with Catherine after a short segment let's now jump into a segment called rich famous and popular with Greg Wilson who wrangles instructor training here at data Camp hey Greg today what do you have for us today Greg well I'd like to start with a story I was on the streetcar a couple of weeks ago headed downtown here in Toronto and sitting beside me there was a woman wearing a hospital staff badge typing away on her laptop I wasn't really paying attention but I suddenly realized that she was typing up a psych evaluation for one of her patients I could see the person's name their address and a couple of paragraphs of very personal information and when I looked up I realized that a couple of other people were reading over the doctors shoulder as well wow that's awful what did you do I noticed her and said maybe you shouldn't be working on that here she got pretty upset slammed the lid of her laptop closed and told me that I shouldn't have been reading what was on her screen now she was right I shouldn't but on the other hand I don't think she should have been doing that work in a public place where passers-by could pick up the most private things of other patients imaginable so what does this have to do with data science well there's been a lot of discussion recently about ethics in data science about the ways that our work can be misused either deliberately or unthinkingly and about what our responsibilities are as data scientists to make sure bad things don't happen most of the discussion has been about the big stuff but I'm starting to think that there are a lot of little things we can and should do to keep ourselves and our data safe and that we ought to start with them can you give me an example sure violet-blue wrote a book a couple of three years ago called the smart girls guide to privacy its aim is to teach teenage girls what they can do to keep their private lives private without going offline entirely simple things like using two-factor authentication for their accounts checking the permissions on their Facebook periodically to make sure the latest updates haven't put things out in the open that shouldn't be and so on the Electronic Frontier Foundation also has some great materials on surveillance self-defense and Patriot have made their security training materials for staff available to the public this sounds like hacking 101 well not really none of this stuff is aimed at programmers it's all the equivalent of don't sneeze on people or use a condom or clean out a scrape with peroxide and I think that if data scientists use a little bit of data hygiene in their own lives they'll be more likely to practice safe data at work cope so why should people stop violet blues book is a quick read and the e FF and Pedro Duty stuff is even shorter I'd really like there to be a data camp course on this you know something like personal data safety for data scientists but our platform is really aimed at people who want to write and run code and you know if we can check or tweak the settings on your phone from data camp comm something's probably gone wrong somewhere I think a webinar or two that walk people through the basics without fear mongering would be easy to set up and if anyone listening is interested I'm Greg get data camp comm and I'd enjoy hearing from you thanks very much Greg if anyone in the audience is interested in this please get in touch we'd love to hear from you thanks Greg and look forward to speaking with you again thanks Hugo time to get straight back into our chat with Catherine I'd love to hear hear your take on on gdpr and we've kind of moved around but I'd love to know exactly what it is and what it means for civilians for users to the start off it yeah so it means that you have a lot more rights than ever before if you're a european resident definitely if you're another person then at least perhaps some of these rights kind of essentially it's like trickle-down economics of rights in that I hope that you you know have taken some time some of the cool things about GDP are that you may or may not know about is you have the right to delete your data so you have the right to request that a company delete all of your data for data science of course we're starting to think about this and be like what does this mean what does it mean for my models what does it mean for my training sets and so forth so this is definitely something to start thinking about and discussing with your team how do we care processes that adequately delete or remove a user's data there's another right to know how your data is used in how your data is processed and also to opt out of that processing if you want to so this is again something we need to think about as data scientists how we build our pipelines how we treat data and how we allow people to opt in and out of probably certain tasks and jobs that we run on data sets over time so you can think of this almost as like a nice flag in a database or something that you store in a separate queryable database that allows you to say ok this person has opted in or out of processing and then there's also right one of my favorite ones is actually the right of data portability and this is the ability to port your data from your current service whatever it might be to another service and the idea is that the data has to be transmitted in a machine-readable way so this is also this idea that you have your data perhaps for some app that you use you would like to try to use a new and different app and you want to make a request to port your data to that different app so this again for data science means that you need to create outbound and inbound workflows or streams or something like this that allows people to transmit their data and I think that this the data portability is a real boon also to startups in general because it's this idea that you know it's kind of like phone number portability right it used to be that once you had a phone number and everybody knew it you were stuck with your service provider until you really wanted to take the big jump and tell everybody you have a new phone number I think with with data we've seen these entrenched like leaders of data science and data collection essentially they've been there for you know now decades essentially and they've had the advantage of the data that they sit upon and with data portability this will hopefully start to shake some things up and create some more competition because the idea that I can take my data with me and move it to another provider is pretty powerful I think and also something that I think is a long time coming yeah me too and I think this is definitely a step in the right direction I want to pick your brain in a minute about whether you think this is enough for what next steps would would look like but there's a term that but you and I and the GDP are and everything around it users constantly and the term is your data now when I use Facebook or I use Twitter I use whatever what is it what is my mine what do I own in that in that relationship and in that so yeah yeah this is actually a subject of scholarly debate I would say right now and we're gonna have to wait and see exactly how the regulator's put this into effect now I'm not a lawyer by any means but from some of the law articles I've read around this the intention of the working group that created that article was that it not simply be just the signup form so their working group notes specifically state that this should be any interactions and data that the user provides the company and so we can think of this is like well maybe that's even goes down to your click stream of data maybe I'd even goes down to every post you have viewed probably it will be enforced like that but we need to think about how we when we're collecting all of this extra data where we're collecting like m tracking users what does this mean in terms of the users that have said no please I don't want to be a part of this and how can we respect things like do-not-track and how can we you know make very clear and evident what we are using data for and have that have the user want to opt into that okay if you provide me this not that I'm gonna give you more targeted ads but I'm gonna be able to offer you this extra feature or something like this so I think you know makes us start thinking about data not just as something that we can you know use however we wish without really asking about it and ask for every single permission on the phone or track people across devices and all these things like that maybe we should ask first and maybe we should think what data we actually really need and provide a compelling product that means that people want us to use their data and could this kind of force a bunch of companies to change business models in essence but because I suppose the you know there's the old trope if you're not paying for the product you are the product right so you've literally got companies that are trying to take as much as possible because of just the value of or let's say the assumed value of the data yeah this is this like also assumed value right so like one common thing I hear when I go to data conferences and I'm hanging out in the data science tracks or or so forth is I hear people say like oh yeah just just collect all the data and we'll just save it in case we need it and some companies have been doing this for decades they essentially have you know data from the early thousands and so forth on users still and it's you're sitting there wondering like when are you gonna actually use this data on how much of this data do you need now I think like for somebody that does retargeting or something like this this is of course this is the bread and butter but for the average website or the average you know app how much do you think that people would be willing to pay to not be tracked to not be targeted and maybe you should start offering similar to some of the products that were launched last week and targeted free or an advertising free experience and I'm hoping that the consumer consumer models also start to change around this of course I have no idea what this will mean in the market in you know five years from now or 10 years from now particularly because most of the offerings so far that have been targeting free or ad free are primarily targeted at you residents and even sometimes not available to us residents so we've seen and heard what the gdpr will look like and what it will mean for civilians what about on the other side of the equation what does it mean for organizations and for working data scientists yes so it means a lot more documentation and a lot more understanding and sharing of exactly how data is processed how it's where it comes from right so this idea of tracking data provenance and what it is used for and I think this is fantastic because if I have been I don't think alone but kind of feeling like I'm sitting here screaming into the void about like documentation testing like version control like normal software practices for data science and I think that this is finally you know the moment where clearing the technical debt debt that a lot of data science teams have accumulated over time of not versioning their models or not having reproducible systems not having deterministic trainings and so forth that this will hopefully kind of like be a turning point where we can get rid of some of this technical debt we can properly document systems that were you can have of course everything under version control and automated testing and all of this is going to benefit you because when you document all of this and you share it you're essentially fulfilling quite a lot of your duties within GD P R which is this ability a for people to opt out of that processing so having a process that allows data to be marked as opt out and then also documenting exactly what processing is used who are downstream consumers of that data and where does the data originate from and under what consent was it given so I think that this covers quite a lot of what GD P our requirements are for data scientists the only thing really that's left out is deletion of old data or anonymization of old data which i think is going to spark hopefully a conversation around how do we expire data or how do we treat data that is old or from previous consent contracts or was purchased and we're not sure exactly how it was collected and under what circumstances and I think that this idea if you're in doubt if you don't know where the data comes from if you've gone through and you've documented all your systems and nobody has any recollection where a particular data set or series of data comes from then you should either delete it or you should go through methods to anonymize it if it's personal data at all and I think that this is essentially a spring cleaning for data science both in terms of our processing and our data sets so I want to come back to this idea of data anonymization first I'd like to know what's been the general response to the gdpr from organizations of course like so i'm mason germany and so of course the opinion here is that you know this from a consumer standpoint and I think from the media standpoint has been very much that this is like a good step from the businesses here I think there's you know that I has been costly both here and I think everywhere to enforce to bring yourself in within compliance before the due date now I must remind everyone that everybody had two years to prepare for this so I mean it was not a surprise that who is going into effect but I think yeah I think for for a lot of folks unfortunately this has been costly I'm hoping that the standards that have now been put in place we're not a rush job and if perhaps have created better processing that actually allows for this type of compliance in the long term right and I think that there's also been a boon within Germany in Europe of startups thinking about these problems and starting to offer things for example like myself andand Reyes with ki protect starting to think about like what is GE PR mean in the long run so in the long term how do we guarantee better security and privacy and make this just a commonplace thing not a compliance thing and am i right in thinking that this doesn't only apply to data from people in the EU but to data from users that is processed in the EU I don't know all of the specifics around this but there's also you know it is I can say a winner of gdpr is European Data Centers and this is because there's a provision within there that talks about moving data outside of the EU so if data originates from the EU and you want to go process it outside of the EU you need to explicitly tell people and they need to opt in saying that it's okay for their data to be processed outside of the EU from what I understand and so there has definitely been like a little bit of a pickup and in the data center action here and of course quite a lot of the large company is that process most of their data let's say in AWS and so forth this means that like finally I have some instances in AWS Frankfurt it was always hard to get like the GPUs and and other things available and now we're starting to see some parody which is nothing but yeah I think you know this is something to think about is when we're moving data all around and we're moving it to different locations and in the cloud and so forth these are real computers in real data centers somewhere and this means that we also need to think about yeah what implications that has a within the security of our data but also be within compliance yeah and I think I was reading a number of tech companies that a processing guy don't have offices in Ireland for tax reasons among among other things they might be moving their data processing out in order to perhaps not have to comply with gdpr fit for the time being yeah that that makes sense and yeah of course in Dublin there's like some very large offices for Apple and Amazon as I understand it in Google and and so forth and so that's usually like the the EU based tech hub essentially for the large corporations and yeah I think this is probably changing the dynamics there and perhaps also changing the dynamics for a lot of the data processing that happens in Luxembourg as well and what happens if companies don't comply so it's the fine is so I think the process goes something like this you're contacted you're supposed to have a data protection officer that's essentially kind of the named person to handle any types of requests and compliance issues I think that first you get some sort of warning and they ask you to become compliant right you have some short period to respond to that and if not then you can get a fine of 20 million euros or four percent of global revenues so it's not it's not a small fine that's the very it's meant to hurt and for this reason a lot of people have been wondering like well will they go after small companies that you know small businesses where this might essentially bankrupt them and this is of course we will wait and see how the regulator's plan on enforcing us and this Data Protection Officer or de Pio I think they're also responsible for if there are any data breaches right in informing the people affected and whatever the governing body is within even 72 hours or something like that and personally not just for a press like a press statement yeah yeah so there needs to be information some out to potentially any affected users as well as of course to their regulation authorities for any data breaches and this can also I believe that this also covers data processor breaches so this is where it comes into effect where if let's say you're reselling data or you're moving data to partners and your partner has a breach then this is also your responsibility to essentially like they should inform you right and then you need to inform the end users and this is hopefully you know avoid some instances like Equifax and so forth in terms of you can't just sit on the fact that there's a security breach for two three months and sell your stocks or you want to do and eventually like oh yeah yeah you may be you may be a victim of identity theft or something like that we'll jump right back into our interview with Katherine after a short segment it's now time for a segment called data science pitfalls as we're talking about data security data privacy and the GDP are today I want to jump in and talk about the problems of data security and privacy inherent in the burgeoning space of machine learning api's such as AT&T speech google prediction and IBM watson to name a few we need to talk about certain aspects of machine learning and the ways in which it can endanger data security and privacy so let's say that you have a machine learning model that is exposed via a public API whoa hold on there and let's explain those terms recall that a machine learning model can be trained on some training data to then make predictions on some new data for example we could give a machine learning model images of people labeled with their names this would be the training data and would train the model to recognize people from unlabeled images with certain degrees of confidence now we are seeing more and more such models being exposed via public API s what this means is that although I don't have the original training data or all the details of the model I can send an image to the model and get back the prediction this seems fine right well it isn't because there are increasingly more and more sophisticated extraction attacks which with no prior knowledge of a machine learning models parameters or training data are able to duplicate the functionality of or steal the model and steal the training data now this is especially the case if the API exposes confidence scores even in a multi class setting now stealing models and training data can be a huge problem particularly when they're trained on sensitive training data have commercial value or are used in security applications will include some references in the show notes including a paper called stealing machine learning models via prediction API s by a team of computer scientists at Cornell Tech the Swiss Institute EPFL in Lausanne and the University of North Carolina this is sure to be an area of data science and artificial intelligence to keep your eye on after that interlude it's time to jump back into our chat with Cashman Jambo so can you give me the rundown of what data privacy looks like currently just the current landscape of how everyone seems to think about it yeah so currently data privacy as we've been thinking about this @ki protect we've been of course investigating kind of where people are coming at this from a variety of markets and so forth so I think currently I see data privacy either as pay-to-play essentially so it's often an add-on that you can buy for large scale enterprise based systems where you say I okay I have all this other things and I'd like you to implement this privacy layer here this is for a lot of the enterprise databases and so forth something that they've been working on for some time which is great I think that that's fantastic that that's available and other than that it's primarily focused you know on on compliance only solutions so this is this idea of like you have HIPAA or you have financial compliance regulations and so forth and these are focused around okay we are a database or we are data processor that only focuses making sure that your Hospital data or your bank data or something is treated in a compliant way and this is mainly for these like data storage database solutions which again is fantastic but what does it mean if you actually want to use your own database and then you would like to use the data in a compliant way and I think that there's been some interesting startups within the space that are trying to cook perhaps like allow you to use a special connector to your database that does something similar to differential privacy not quite differential privacy because this is of course very difficult but similar to differential privacy or that employees k and anima T or that employees something else like this so there's a few companies in the space of essentially trying to be the query layer and then using your data sources below that and providing some sort of guarantees whether it be differential privacy for without necessarily like a long term privacy budget or whether it be ke anonymity or whether it be pseudonym is sudo immunization I can't either so it's a lot of these like you know kind of extra add-ons and other than that I think most of the privacy conversation has been really led by academia and Cynthia dorks research on differential privacy and its implications also within machine learning as well as some of the great research that Nicolas PAP or not and some of the Google brain security researchers have been working on this have been I think amazing contributions but research right perhaps implemented at Google or with Cynthia Rourke's work with Microsoft and so forth but as far as available to you know data scientists at my own startup or something this has really not been available in a in a real way right and you've mentioned or we've discussed a variety of techniques such as anonymization pseudonym ization she says she Dhanam ization there we go differential privacy ke and anonymity and we'll link to a bunch of references in in the show notes that people can check out with respect to their nuts and bolts of these but my real question is can we really anonymize data yeah of course like of much debate right the gold standard is of course differential privacy and the idea of differential privacy is that it's a fairly simple equation when you actually read it and this idea that I would not know that you yourself as an individual were a part of any data set based on the queries or the data that I see from that data set that there would be within a very small epsilon the ability to determine the probability that you are a part of that data set or not and this is of course really elegant theory and I highly recommend reading dwarfs work on but in terms of actually implementing it in the way that we use data science today most of the ways that we guarantee differential privacy is using what is often referred to as a privacy budget and this budget essentially tracks how much information we can think of it almost as like information game theory right how much information about any individual was essentially gained by the other person via the query or via the data that they accessed and once the privacy budget really reaches a certain level then we say okay then there can be no more queries that might reveal more information about this individual and so this is difficult because a in practice we often have changing datasets so the data set that I can guarantee privacy on today and the data set I can guarantee privacy on tomorrow this is like an ever-changing right we're like gathering more data as time goes by we might have more information that we garner and connect about a particular individual and the more that we do this the you know of course like the less that we can guarantee privacy and the second thing is is that to keep the privacy budget let's say like indeterminately this would mean that eventually our data would not be able to be utilized right because we would eventually hit the limits of our privacy budget and unless that privacy budget is reset for some reason then that person or that analyst or that data scientist cannot query any information that might be related to that individual right and so what we see in differential privacy that's been implemented for example by the Apple differential privacy team or with some of the works that Google has been doing this is you normally a privacy budget within a limited time period so resetting every day or resetting every few days or something like this when we think about anonymized data within in any organization and in particular from you know civilians who are who are users of these products I think one really important question is how do we know about how our data is being used as users my question for you is how technical and how educated civilians and users on the ground need to be to understand what's happening with that data yeah I mean this is this is interesting and something that Andres and I have been thinking about kind of like doing a series of articles and and so forth that kind of explain how privacy works and how de anonymization really works at a large scale because I think the average data scientists you know they've heard about the Netflix prize that you know about the New York City Taxi data in the sense that with an informed adversary or an adversary with access to potentially some open data this is quite easy to do naanum eyes when we're dealing with large-scale data sets but I mean if I were to ask my mom let's say or my sister hey do you know like if you upload that extra thing to Facebook and then any of your Facebook data leaks do you know that the ability for somebody to D anonymizing you is you know essentially like guaranteed I think you know maybe not I don't think we're there in terms of the public conversation but I do think that breaches like I forget the name but the the running application that recently had I data they released an open dataset Strava I think it was called and their open data set essentially leaked information about so-called private US military bases or secret US military bases across the world the fitness tracker which most the consider syns and then you could see in key locations in Middle Eastern African nations yeah yeah yeah see the military compounds on the map because it would yeah and this is what happens when we aggregate data and this is especially a danger for people that are like releasing public data right or you can even think of it you're selling data or sending it to a partner or something that when we aggregate this data and even if we say we so called anonymized it then data in aggregate can also release secrets so this may not be secrets about an individual anymore but this may be some sort of secret about the group that uses your application right so it saddens me to say this but we're coming to to the end of this this conversation and something I mentioned at the start is that gdpr I think as as you said as well is very necessary and untimely my question for you is is it enough and or what do we need to be doing or what would you like to see in the future to make make further steps in this direction yeah so GP are by no means guarantees anonymization and this I think might be something that we should really push for within the data science and machine learning community is how can we solve this very difficult problem or how can we at least make some inroads to this problem so that you know when there's a security breach or when there's some issue or when you know somebody gets their laptop stole and oops they had a bunch of customer data or other sensitive data on it when these things happen maybe we can stop them at the source right maybe we don't necessarily need to always use complete personal data to build a model maybe we can start thinking about how to privatize our data in a way before we start the data science process and again that's this is definitely something we're thinking about working on at ki protect but this is something that I really hope overall as a field we can push forward and it has some interesting implications as well with ethics so there's a great paper again primary author or Cynthia twerk comparing this idea of differential privacy to also the same basis of ethics in a sense that if you do not know my race or if you do not know my gender or my age or something like this you have the potential to build a fairer model right and so I think that these have interesting overlaps and implications for our industry and I really hope that we start to think about them as a wholesale solution not just as a compliance only means I have to do this much so I this is what I'm hopeful for it's something that I look forward to seeing more in research and also chatting more with my peers and so forth yeah I like that because it sounds like in a certain way mindful data science in the sense that you just don't take all the data you have and throw a model at it and see see what comes out right yeah you you think about the implications you know of any other data that you share that you expose both to your team internally into any anyone externally that you think about you know essentially what I want somebody to do this with my data so the golden rule of data science yeah great so do you have a final call to action for all our listeners out there yeah sure hey you can check out all of the work that we're working on and we're looking for feedback still with ki protect so if you want to reach out we're at ki protect comm and then also just if you're working within the space if you're thinking about these problems keep keep at it you're not alone and also feel free to reach out that I think that we need to kind of create a really vocal community within data science that these are important these are essential and that this is not only for researchers although I'm really a big fan of what the research community has been doing but this is also something that practitioners care about and that we want to be able to implement what we're seeing in research and the advances that we're seeing in terms of potentially guaranteeing privacy preserving machine learning we want to see this like within the greater community and within the tools and open-source projects that we love and use Katherine it's been such a pleasure having you on the show thanks so much huger I really appreciate it thanks for joining our conversation with Kathryn about data security data privacy and the GD P R we saw that the GD P R gives consumers more rights and knowledge as to how their data is used such as rights to delete your data to know how your data is used and processed and to port your data to other services we also saw that for organizations and working data scientists although there is significant work to be done it will result in paying off a whole lot of already incurred technical debt in terms of developing better documentation and building better systems for tracking and managing data with a focus on user centered design we also saw once again the growing importance of data scientists thinking about the implications of their work and building ethical models and the importance of both privacy and consent by design do we really need to incorporate a user zip code into a model do we really need their last thousand purchases and why do I need access to their personally identifiable information does anyone really outside of perhaps customer service make sure to check out our next episode a conversation with Jonathan Nola's about organizing data science teams and the do's and don'ts of managing them whether dive into best practices for hiring data scientists Jonathan is a data science leader in the Seattle area with over a decade of experience he is currently running a consulting firm helping fortune 500 companies with data science machine learning and artificial intelligence we also tackle questions such as what is a more important skill for a data scientist the ability to use the most sophisticated deep learning models or being able to make good PowerPoint slides the answer may surprise you but then again it may not there's only one way to find out I'm your host Hugo Bound Anderson you can follow me on Twitter as you go down and data camp at data camp you can find all our episodes and show notes at data camp comm slash community slash podcastin this episode of data framed a data cap podcast I'll be speaking with Kathryn Sharma a data scientist consultant educator and co-founder of ki protect a company that provides real-time protection for your data infrastructure data science and artificial intelligence Kathryn and I spoke on the 25th of May 2018 about data privacy data security and the GD P R or general data protection regulation which went into effect the week prior what are the biggest challenges currently facing data security and privacy what does the GD P R mean for civilians working data scientists and businesses around the world is data anonymization actually possible or a pipe dream stick around to find out I'm Hugo Bound Anderson a data scientist the data gap and this is data framed welcome to data frame a weekly data count podcast exploring what data science looks like on the ground for working data scientists and what problem is it consult I'm your host Hugo Bound Anderson you can follow me on Twitter as you go down and data camp at data count you can find all our episodes and show notes at data camp comm slash community slash podcast hi there Kathryn and welcome to data frames hey thanks for having me it's such a pleasure to have you on the show so before we dive in to everything we're talking about I just want to let you know I've been receiving far too many emails the past couple of weeks I know do you have any indication why yeah yeah emails that ask you asking for consent this is consensual email at its best it's emails asking you if you want to receive more emails you know and why is this happening what's been happening recently that means I'm getting my inbox is really full yeah so probably if you deal with data and you've already heard of gdpr or the general data protection regulation but it went into effect on May 25th and everybody got a lot of emails just talking about privacy it was really fantastic you know it felt like you know finally consensual data collection we're having lots of conversations but yeah I think most people just kind of deleted them all which is for better or worse maybe what what they all expected when it was sent as a mad rush on the final day for sure and it seems like a lot of its opt-in as well isn't it yeah so the idea behind gdpr is that it's essentially as I said like this consensual driven so consent driven in the sense that you have a right to say okay I'm fine with you using my data in these ways or I'm fine with you collecting my data in these ways or I'm fine with you reaching me out to me in these ways and I think that's a really great step I mean I think consent for everything is really cool Sept and consensual data collection is something that I think we are hopefully starting to realize is really Center of doing ethical data science and so I think gdpr is a nice step towards that it gives a lot of rights to European residents and I wonder how it will also affect data collection for the rest of the world it appears that you know people have different approaches for this as and some people are essentially creating EU only versions of their platform for example like the USA Today EU site and a few of the other papers and publications published an EU only package but what I am hoping is that it also allows for a little bit more consensual data collection of people even outside of the EU right I think this provides a really nice teaser of the conversation we'll be getting into with respect to data security privacy the GDP are whether it's enough what we might see outside the EU but before all of that I want to find out a bit about you and maybe you could start and tell us how you got into data science initially yeah so it was definitely by accident as we've kind of I guess I would say a lot of people in my era of computing which essentially was that I was a data journalist I was working at The Washington Post I had had some history and background in some computer science and statistics but I didn't enter it directly after school and and I found myself in data journalism after that I got recruited to work at a few startups doing some sort of larger scale data collection and data engineering essentially back in the initial Hadoop days and from there I went kind of into some ops and security roles automating deployment and kind of leading teams on those types of things and then fell back into Data doing some data wrangling after my book was published with Jacquie castle data wrangling with Python and since then have been focused more on natural language processing and machine learning and lately been thinking a lot about data privacy and data security kind of I guess after ten years in this business you start to like think about the intersections of things you care about and for me that's definitely an important intersection that I think ties in a lot of the passions I have an experience I have in data science as a whole and where is that led you know what are you working on currently yeah so I'm currently building a new startup called ki protects its ki after Kinsley's intelligence which is essentially the German translation of AI and our idea and our goal our solution really is to bring about kind of data science a data science compatible data security layer so the idea is from my experience and from our experience Andres Davis and I we have seen that the security community in the data science community are not necessarily like overlapping in meaningful ways right now and we're trying to think about how we can bring more data security and data privacy concepts to the data science community that makes them really easy to use really I would say like consumer-friendly in a sense of being easily integrate able into systems that you might use to process your data like Apache Kafka and spark and so forth and to make it so that you don't have to have data privacy or data security as a core concept of your data science team you can just do normal data science and you can use our service to help you enforce privacy and security great so it sounds like you're essentially trying to help people keep doing as they're doing and you'll fill in this particular gap for them with respect to data security and data privacy yeah that's the goal is to make it the plug-in for data security or data privacy of course this is a delicate and complex topic so we're exploring kind of what we can guarantee and what integrations make sense for different types of companies so we don't have like a full product spread available yet but this is something that we're actively experimenting with researching and working on we're fairly confident that we can come up with a few different methods that allow you to use simple api's for sudonym ization and anonymization of your data sets awesome so I want to now get into data security and data privacy but before that you mentioned you love the NLP and I just wanted to let everyone know that you've also got a great data camp course on fundamentals of NLP and Python which I had the great pleasure of working on with you yeah it was super fun and I loved all the great feedback and so forth from folks and yeah if you're starting to get into natural language processing or you're curious what it's all about I can definitely recommend taking that as well as the follow up courses which allow for some fun experimentation with some of the common and best libraries in natural language processing exactly so let's jump in what are the biggest challenges currently facing data security and data privacy in your mind yeah so I think one thing that I've noticed over time is the core competency of most data scientists is not necessarily focused on security and privacy now we're starting to see you know perhaps with for example the Apple differential privacy team and the Google brain research that has been focused on security and machine learning more overlap but the average person who has studied statistics or machine learning and who's doing this in the field they don't necessarily have a strong background in computer security or in data security or InfoSec as we might call it right and this is not their pro you know I don't see this as a fault of theirs it's nothing lacking right they have a lot of their own specialized training but the unfortunate circumstance of that is that a lot of the way that we manage and handled a it's not necessarily the most secure way and it definitely doesn't always take I the ideas of privacy or even user sensitivity in the sense of do I actually need access to full user data it doesn't really take these into account very often and therefore as data scientists you know we have access to potentially millions of people's personal data their messages their emails or chats their purchase history we have access to all of these things and you know kind of my question is is do we actually need access to all of these to do our job properly and I think this is perhaps a big oversight in terms of how we've built up the data management and data science and bi platforms that we use today you spoke to the lack of focus on knowledge with respect to data security do you think this is related to the lack of focus on kind of building ethical systems in general for data science well yeah I mean one of the conversations I've found myself having recently in lieu of the gdpr is that it's been really painful for people to implement consensual data collection in their data science and you know that why is that it's because the software is not designed with the user in mind right so the software's may be designed with the end-user the internal team in mind but is often not designed with the actual customer or the client in mind if we had a software that was slightly more driven by the clients desires or demands like this kind of touches upon design thinking right then it should be cognizant that when we collect user data that we have marked when they consented that we have marked what is the provenance of the data that we have marked how was the data processed and all of these things and the fact that you know data provenance has been more of an aspect of research than actually an aspect of every type of data collection software that you can imagine this is really problematic right because we have accumulated all this data and you know for some larger corporations sometimes they have purchased data or they have aggregated data from data marketplaces and so forth and this means that they now have all of this data some of which was given directly and consensually and some of it which was just collected by purchasing power or by buying another company and so forth and so you know this is a nightmare of course when it comes to GPR and you have to figure out and sort out what data was given by whom and under what circumstances but why why do we have a why might we have this problem in the first place like why can't we just have perhaps state marketplaces where consumers directly sell their data if they're going to do that or also why isn't data provenance and essentially like where this data come from and when does it expire how long is it good for why aren't these like a normal part of how we do data management from the beginning I'm interested in how you feel the average if this is even a well form question how the average data scientist responds to you know this type of legislation being passed and if you can't speak to the average maybe you could give a variety of responses that let you think of paradigms of how the community is responding yeah I mean I think that I guess I would say that I have a feeling people on are inherently good and want to build ethical systems this is like more the viewpoint that I'm coming from and I think that a lot of people are like okay this is painful but I want to be able to do the right thing I want to be able to do epical data science what does this mean how might I have to change the ways that I currently process data so I think it's sparking a lot of conversations that are thinking okay well perhaps in the past we haven't done this very well how might we start again or how might we better do this in the future but then I do think that there are some people that are just like I don't see it as a nuisance and you know there's been this big rash of variety of software and other platform vendors that are simply saying well we're not going to sell to EU residents anymore and I see this as terrifying right because why would I want to use a service that can't guarantee that they're going to ask me if they can use my data right this is I think you know shows that there's essentially I would I would argue that there's a big divide between those let's see privacy as a burden and those that see privacy as maybe something that we can strive for that we need to think about and perhaps change the systems and processes that we use in the mean time and how do you think data scientists generally feel about the idea of sacrificing some model performance for having more ethical models yeah I mean I think that this is difficult so I've spoken on the topic of ethical machine learning a few times now and a few times the reaction was very negative and people were like well I don't really see why this is my problem and I think that unfortunately there is some of that idea like well if black folks are treated differently by cops why why should I have to essentially change the distribution of my dataset to compensate for this right they say well well the data is there and that's what the data says and so I'm just going to build exactly what the data says and I would say that that is a choice in an action in and of itself and if you're making that choice in action you're essentially automating inequalities and you're automating biases societal biases and when you choose to do that you're making a statement and I would say that the statement is that you say that those biases are valid you say that it's valid that people are treated differently based on their skin color from police or that women earn less than men this is something that you're validating if you just say well that's what the underlying statistics of myself we'll say so that's what I'm gonna do and so I've definitely had those conversations numerous times and then I've also had conversations with people like how this this is really cool this makes sense like it's so nice to know that there's quite a lot of different performance metrics you can use to analyze the ethics or the treatment of different groups from your model and I think that there's also like new energy specifically around fad ml and everything that's happening in academia around finding real ways to build ethical models that don't necessarily sacrifice much performance at all yeah and I think something you mentioned there is that some people have responded it's not my job to to think about these things one thing that data science doesn't have as a profession yet is standards of practice codes of conduct necessarily and if we think back what's happened historically in other professions you know in ancient Greece the Hippocratic oath was developed to deal with these types of things for people practicing medicine right yeah and I think that you know if you're building some system that maybe controls and some IOT factory device where no humans are affected at all by what you're doing or if you're making some sort of academic model yeah okay maybe your impact is very small right but when we're building these systems that interact with humans and now quite a lot that interact directly with like we would say the consumer right or a person and affects maybe what the person sees what they click on what they think about what price they pay and then of course the massive systems like finance and justice and so forth this is like the impacts we have a growing footprint of the things that data science touches and affects and because of this I think that we need to start thinking about if we don't have a Hippocratic oath like what do we have right I do think there are so many more increasingly more and more such examples emerging I think one of the ones that I've mentioned a few times on on the podcast is judges using the output of blackbox model that tells recidivism rate for for incarcerated people using the output of that model as input for the parole hearing right actually Cathy O'Neal's book weapons of mass destruction which I recommend everyone who wants to think about these these things and actually probably recommend it for more more to people who don't want to think about these things to check out that book yeah there's a new one also called automating inequality which is quite good so I can recommend that one as well yeah we'll link to those in the show notes we'll jump right back into our interview with Catherine after a short segment let's now jump into a segment called rich famous and popular with Greg Wilson who wrangles instructor training here at data Camp hey Greg today what do you have for us today Greg well I'd like to start with a story I was on the streetcar a couple of weeks ago headed downtown here in Toronto and sitting beside me there was a woman wearing a hospital staff badge typing away on her laptop I wasn't really paying attention but I suddenly realized that she was typing up a psych evaluation for one of her patients I could see the person's name their address and a couple of paragraphs of very personal information and when I looked up I realized that a couple of other people were reading over the doctors shoulder as well wow that's awful what did you do I noticed her and said maybe you shouldn't be working on that here she got pretty upset slammed the lid of her laptop closed and told me that I shouldn't have been reading what was on her screen now she was right I shouldn't but on the other hand I don't think she should have been doing that work in a public place where passers-by could pick up the most private things of other patients imaginable so what does this have to do with data science well there's been a lot of discussion recently about ethics in data science about the ways that our work can be misused either deliberately or unthinkingly and about what our responsibilities are as data scientists to make sure bad things don't happen most of the discussion has been about the big stuff but I'm starting to think that there are a lot of little things we can and should do to keep ourselves and our data safe and that we ought to start with them can you give me an example sure violet-blue wrote a book a couple of three years ago called the smart girls guide to privacy its aim is to teach teenage girls what they can do to keep their private lives private without going offline entirely simple things like using two-factor authentication for their accounts checking the permissions on their Facebook periodically to make sure the latest updates haven't put things out in the open that shouldn't be and so on the Electronic Frontier Foundation also has some great materials on surveillance self-defense and Patriot have made their security training materials for staff available to the public this sounds like hacking 101 well not really none of this stuff is aimed at programmers it's all the equivalent of don't sneeze on people or use a condom or clean out a scrape with peroxide and I think that if data scientists use a little bit of data hygiene in their own lives they'll be more likely to practice safe data at work cope so why should people stop violet blues book is a quick read and the e FF and Pedro Duty stuff is even shorter I'd really like there to be a data camp course on this you know something like personal data safety for data scientists but our platform is really aimed at people who want to write and run code and you know if we can check or tweak the settings on your phone from data camp comm something's probably gone wrong somewhere I think a webinar or two that walk people through the basics without fear mongering would be easy to set up and if anyone listening is interested I'm Greg get data camp comm and I'd enjoy hearing from you thanks very much Greg if anyone in the audience is interested in this please get in touch we'd love to hear from you thanks Greg and look forward to speaking with you again thanks Hugo time to get straight back into our chat with Catherine I'd love to hear hear your take on on gdpr and we've kind of moved around but I'd love to know exactly what it is and what it means for civilians for users to the start off it yeah so it means that you have a lot more rights than ever before if you're a european resident definitely if you're another person then at least perhaps some of these rights kind of essentially it's like trickle-down economics of rights in that I hope that you you know have taken some time some of the cool things about GDP are that you may or may not know about is you have the right to delete your data so you have the right to request that a company delete all of your data for data science of course we're starting to think about this and be like what does this mean what does it mean for my models what does it mean for my training sets and so forth so this is definitely something to start thinking about and discussing with your team how do we care processes that adequately delete or remove a user's data there's another right to know how your data is used in how your data is processed and also to opt out of that processing if you want to so this is again something we need to think about as data scientists how we build our pipelines how we treat data and how we allow people to opt in and out of probably certain tasks and jobs that we run on data sets over time so you can think of this almost as like a nice flag in a database or something that you store in a separate queryable database that allows you to say ok this person has opted in or out of processing and then there's also right one of my favorite ones is actually the right of data portability and this is the ability to port your data from your current service whatever it might be to another service and the idea is that the data has to be transmitted in a machine-readable way so this is also this idea that you have your data perhaps for some app that you use you would like to try to use a new and different app and you want to make a request to port your data to that different app so this again for data science means that you need to create outbound and inbound workflows or streams or something like this that allows people to transmit their data and I think that this the data portability is a real boon also to startups in general because it's this idea that you know it's kind of like phone number portability right it used to be that once you had a phone number and everybody knew it you were stuck with your service provider until you really wanted to take the big jump and tell everybody you have a new phone number I think with with data we've seen these entrenched like leaders of data science and data collection essentially they've been there for you know now decades essentially and they've had the advantage of the data that they sit upon and with data portability this will hopefully start to shake some things up and create some more competition because the idea that I can take my data with me and move it to another provider is pretty powerful I think and also something that I think is a long time coming yeah me too and I think this is definitely a step in the right direction I want to pick your brain in a minute about whether you think this is enough for what next steps would would look like but there's a term that but you and I and the GDP are and everything around it users constantly and the term is your data now when I use Facebook or I use Twitter I use whatever what is it what is my mine what do I own in that in that relationship and in that so yeah yeah this is actually a subject of scholarly debate I would say right now and we're gonna have to wait and see exactly how the regulator's put this into effect now I'm not a lawyer by any means but from some of the law articles I've read around this the intention of the working group that created that article was that it not simply be just the signup form so their working group notes specifically state that this should be any interactions and data that the user provides the company and so we can think of this is like well maybe that's even goes down to your click stream of data maybe I'd even goes down to every post you have viewed probably it will be enforced like that but we need to think about how we when we're collecting all of this extra data where we're collecting like m tracking users what does this mean in terms of the users that have said no please I don't want to be a part of this and how can we respect things like do-not-track and how can we you know make very clear and evident what we are using data for and have that have the user want to opt into that okay if you provide me this not that I'm gonna give you more targeted ads but I'm gonna be able to offer you this extra feature or something like this so I think you know makes us start thinking about data not just as something that we can you know use however we wish without really asking about it and ask for every single permission on the phone or track people across devices and all these things like that maybe we should ask first and maybe we should think what data we actually really need and provide a compelling product that means that people want us to use their data and could this kind of force a bunch of companies to change business models in essence but because I suppose the you know there's the old trope if you're not paying for the product you are the product right so you've literally got companies that are trying to take as much as possible because of just the value of or let's say the assumed value of the data yeah this is this like also assumed value right so like one common thing I hear when I go to data conferences and I'm hanging out in the data science tracks or or so forth is I hear people say like oh yeah just just collect all the data and we'll just save it in case we need it and some companies have been doing this for decades they essentially have you know data from the early thousands and so forth on users still and it's you're sitting there wondering like when are you gonna actually use this data on how much of this data do you need now I think like for somebody that does retargeting or something like this this is of course this is the bread and butter but for the average website or the average you know app how much do you think that people would be willing to pay to not be tracked to not be targeted and maybe you should start offering similar to some of the products that were launched last week and targeted free or an advertising free experience and I'm hoping that the consumer consumer models also start to change around this of course I have no idea what this will mean in the market in you know five years from now or 10 years from now particularly because most of the offerings so far that have been targeting free or ad free are primarily targeted at you residents and even sometimes not available to us residents so we've seen and heard what the gdpr will look like and what it will mean for civilians what about on the other side of the equation what does it mean for organizations and for working data scientists yes so it means a lot more documentation and a lot more understanding and sharing of exactly how data is processed how it's where it comes from right so this idea of tracking data provenance and what it is used for and I think this is fantastic because if I have been I don't think alone but kind of feeling like I'm sitting here screaming into the void about like documentation testing like version control like normal software practices for data science and I think that this is finally you know the moment where clearing the technical debt debt that a lot of data science teams have accumulated over time of not versioning their models or not having reproducible systems not having deterministic trainings and so forth that this will hopefully kind of like be a turning point where we can get rid of some of this technical debt we can properly document systems that were you can have of course everything under version control and automated testing and all of this is going to benefit you because when you document all of this and you share it you're essentially fulfilling quite a lot of your duties within GD P R which is this ability a for people to opt out of that processing so having a process that allows data to be marked as opt out and then also documenting exactly what processing is used who are downstream consumers of that data and where does the data originate from and under what consent was it given so I think that this covers quite a lot of what GD P our requirements are for data scientists the only thing really that's left out is deletion of old data or anonymization of old data which i think is going to spark hopefully a conversation around how do we expire data or how do we treat data that is old or from previous consent contracts or was purchased and we're not sure exactly how it was collected and under what circumstances and I think that this idea if you're in doubt if you don't know where the data comes from if you've gone through and you've documented all your systems and nobody has any recollection where a particular data set or series of data comes from then you should either delete it or you should go through methods to anonymize it if it's personal data at all and I think that this is essentially a spring cleaning for data science both in terms of our processing and our data sets so I want to come back to this idea of data anonymization first I'd like to know what's been the general response to the gdpr from organizations of course like so i'm mason germany and so of course the opinion here is that you know this from a consumer standpoint and I think from the media standpoint has been very much that this is like a good step from the businesses here I think there's you know that I has been costly both here and I think everywhere to enforce to bring yourself in within compliance before the due date now I must remind everyone that everybody had two years to prepare for this so I mean it was not a surprise that who is going into effect but I think yeah I think for for a lot of folks unfortunately this has been costly I'm hoping that the standards that have now been put in place we're not a rush job and if perhaps have created better processing that actually allows for this type of compliance in the long term right and I think that there's also been a boon within Germany in Europe of startups thinking about these problems and starting to offer things for example like myself andand Reyes with ki protect starting to think about like what is GE PR mean in the long run so in the long term how do we guarantee better security and privacy and make this just a commonplace thing not a compliance thing and am i right in thinking that this doesn't only apply to data from people in the EU but to data from users that is processed in the EU I don't know all of the specifics around this but there's also you know it is I can say a winner of gdpr is European Data Centers and this is because there's a provision within there that talks about moving data outside of the EU so if data originates from the EU and you want to go process it outside of the EU you need to explicitly tell people and they need to opt in saying that it's okay for their data to be processed outside of the EU from what I understand and so there has definitely been like a little bit of a pickup and in the data center action here and of course quite a lot of the large company is that process most of their data let's say in AWS and so forth this means that like finally I have some instances in AWS Frankfurt it was always hard to get like the GPUs and and other things available and now we're starting to see some parody which is nothing but yeah I think you know this is something to think about is when we're moving data all around and we're moving it to different locations and in the cloud and so forth these are real computers in real data centers somewhere and this means that we also need to think about yeah what implications that has a within the security of our data but also be within compliance yeah and I think I was reading a number of tech companies that a processing guy don't have offices in Ireland for tax reasons among among other things they might be moving their data processing out in order to perhaps not have to comply with gdpr fit for the time being yeah that that makes sense and yeah of course in Dublin there's like some very large offices for Apple and Amazon as I understand it in Google and and so forth and so that's usually like the the EU based tech hub essentially for the large corporations and yeah I think this is probably changing the dynamics there and perhaps also changing the dynamics for a lot of the data processing that happens in Luxembourg as well and what happens if companies don't comply so it's the fine is so I think the process goes something like this you're contacted you're supposed to have a data protection officer that's essentially kind of the named person to handle any types of requests and compliance issues I think that first you get some sort of warning and they ask you to become compliant right you have some short period to respond to that and if not then you can get a fine of 20 million euros or four percent of global revenues so it's not it's not a small fine that's the very it's meant to hurt and for this reason a lot of people have been wondering like well will they go after small companies that you know small businesses where this might essentially bankrupt them and this is of course we will wait and see how the regulator's plan on enforcing us and this Data Protection Officer or de Pio I think they're also responsible for if there are any data breaches right in informing the people affected and whatever the governing body is within even 72 hours or something like that and personally not just for a press like a press statement yeah yeah so there needs to be information some out to potentially any affected users as well as of course to their regulation authorities for any data breaches and this can also I believe that this also covers data processor breaches so this is where it comes into effect where if let's say you're reselling data or you're moving data to partners and your partner has a breach then this is also your responsibility to essentially like they should inform you right and then you need to inform the end users and this is hopefully you know avoid some instances like Equifax and so forth in terms of you can't just sit on the fact that there's a security breach for two three months and sell your stocks or you want to do and eventually like oh yeah yeah you may be you may be a victim of identity theft or something like that we'll jump right back into our interview with Katherine after a short segment it's now time for a segment called data science pitfalls as we're talking about data security data privacy and the GDP are today I want to jump in and talk about the problems of data security and privacy inherent in the burgeoning space of machine learning api's such as AT&T speech google prediction and IBM watson to name a few we need to talk about certain aspects of machine learning and the ways in which it can endanger data security and privacy so let's say that you have a machine learning model that is exposed via a public API whoa hold on there and let's explain those terms recall that a machine learning model can be trained on some training data to then make predictions on some new data for example we could give a machine learning model images of people labeled with their names this would be the training data and would train the model to recognize people from unlabeled images with certain degrees of confidence now we are seeing more and more such models being exposed via public API s what this means is that although I don't have the original training data or all the details of the model I can send an image to the model and get back the prediction this seems fine right well it isn't because there are increasingly more and more sophisticated extraction attacks which with no prior knowledge of a machine learning models parameters or training data are able to duplicate the functionality of or steal the model and steal the training data now this is especially the case if the API exposes confidence scores even in a multi class setting now stealing models and training data can be a huge problem particularly when they're trained on sensitive training data have commercial value or are used in security applications will include some references in the show notes including a paper called stealing machine learning models via prediction API s by a team of computer scientists at Cornell Tech the Swiss Institute EPFL in Lausanne and the University of North Carolina this is sure to be an area of data science and artificial intelligence to keep your eye on after that interlude it's time to jump back into our chat with Cashman Jambo so can you give me the rundown of what data privacy looks like currently just the current landscape of how everyone seems to think about it yeah so currently data privacy as we've been thinking about this @ki protect we've been of course investigating kind of where people are coming at this from a variety of markets and so forth so I think currently I see data privacy either as pay-to-play essentially so it's often an add-on that you can buy for large scale enterprise based systems where you say I okay I have all this other things and I'd like you to implement this privacy layer here this is for a lot of the enterprise databases and so forth something that they've been working on for some time which is great I think that that's fantastic that that's available and other than that it's primarily focused you know on on compliance only solutions so this is this idea of like you have HIPAA or you have financial compliance regulations and so forth and these are focused around okay we are a database or we are data processor that only focuses making sure that your Hospital data or your bank data or something is treated in a compliant way and this is mainly for these like data storage database solutions which again is fantastic but what does it mean if you actually want to use your own database and then you would like to use the data in a compliant way and I think that there's been some interesting startups within the space that are trying to cook perhaps like allow you to use a special connector to your database that does something similar to differential privacy not quite differential privacy because this is of course very difficult but similar to differential privacy or that employees k and anima T or that employees something else like this so there's a few companies in the space of essentially trying to be the query layer and then using your data sources below that and providing some sort of guarantees whether it be differential privacy for without necessarily like a long term privacy budget or whether it be ke anonymity or whether it be pseudonym is sudo immunization I can't either so it's a lot of these like you know kind of extra add-ons and other than that I think most of the privacy conversation has been really led by academia and Cynthia dorks research on differential privacy and its implications also within machine learning as well as some of the great research that Nicolas PAP or not and some of the Google brain security researchers have been working on this have been I think amazing contributions but research right perhaps implemented at Google or with Cynthia Rourke's work with Microsoft and so forth but as far as available to you know data scientists at my own startup or something this has really not been available in a in a real way right and you've mentioned or we've discussed a variety of techniques such as anonymization pseudonym ization she says she Dhanam ization there we go differential privacy ke and anonymity and we'll link to a bunch of references in in the show notes that people can check out with respect to their nuts and bolts of these but my real question is can we really anonymize data yeah of course like of much debate right the gold standard is of course differential privacy and the idea of differential privacy is that it's a fairly simple equation when you actually read it and this idea that I would not know that you yourself as an individual were a part of any data set based on the queries or the data that I see from that data set that there would be within a very small epsilon the ability to determine the probability that you are a part of that data set or not and this is of course really elegant theory and I highly recommend reading dwarfs work on but in terms of actually implementing it in the way that we use data science today most of the ways that we guarantee differential privacy is using what is often referred to as a privacy budget and this budget essentially tracks how much information we can think of it almost as like information game theory right how much information about any individual was essentially gained by the other person via the query or via the data that they accessed and once the privacy budget really reaches a certain level then we say okay then there can be no more queries that might reveal more information about this individual and so this is difficult because a in practice we often have changing datasets so the data set that I can guarantee privacy on today and the data set I can guarantee privacy on tomorrow this is like an ever-changing right we're like gathering more data as time goes by we might have more information that we garner and connect about a particular individual and the more that we do this the you know of course like the less that we can guarantee privacy and the second thing is is that to keep the privacy budget let's say like indeterminately this would mean that eventually our data would not be able to be utilized right because we would eventually hit the limits of our privacy budget and unless that privacy budget is reset for some reason then that person or that analyst or that data scientist cannot query any information that might be related to that individual right and so what we see in differential privacy that's been implemented for example by the Apple differential privacy team or with some of the works that Google has been doing this is you normally a privacy budget within a limited time period so resetting every day or resetting every few days or something like this when we think about anonymized data within in any organization and in particular from you know civilians who are who are users of these products I think one really important question is how do we know about how our data is being used as users my question for you is how technical and how educated civilians and users on the ground need to be to understand what's happening with that data yeah I mean this is this is interesting and something that Andres and I have been thinking about kind of like doing a series of articles and and so forth that kind of explain how privacy works and how de anonymization really works at a large scale because I think the average data scientists you know they've heard about the Netflix prize that you know about the New York City Taxi data in the sense that with an informed adversary or an adversary with access to potentially some open data this is quite easy to do naanum eyes when we're dealing with large-scale data sets but I mean if I were to ask my mom let's say or my sister hey do you know like if you upload that extra thing to Facebook and then any of your Facebook data leaks do you know that the ability for somebody to D anonymizing you is you know essentially like guaranteed I think you know maybe not I don't think we're there in terms of the public conversation but I do think that breaches like I forget the name but the the running application that recently had I data they released an open dataset Strava I think it was called and their open data set essentially leaked information about so-called private US military bases or secret US military bases across the world the fitness tracker which most the consider syns and then you could see in key locations in Middle Eastern African nations yeah yeah yeah see the military compounds on the map because it would yeah and this is what happens when we aggregate data and this is especially a danger for people that are like releasing public data right or you can even think of it you're selling data or sending it to a partner or something that when we aggregate this data and even if we say we so called anonymized it then data in aggregate can also release secrets so this may not be secrets about an individual anymore but this may be some sort of secret about the group that uses your application right so it saddens me to say this but we're coming to to the end of this this conversation and something I mentioned at the start is that gdpr I think as as you said as well is very necessary and untimely my question for you is is it enough and or what do we need to be doing or what would you like to see in the future to make make further steps in this direction yeah so GP are by no means guarantees anonymization and this I think might be something that we should really push for within the data science and machine learning community is how can we solve this very difficult problem or how can we at least make some inroads to this problem so that you know when there's a security breach or when there's some issue or when you know somebody gets their laptop stole and oops they had a bunch of customer data or other sensitive data on it when these things happen maybe we can stop them at the source right maybe we don't necessarily need to always use complete personal data to build a model maybe we can start thinking about how to privatize our data in a way before we start the data science process and again that's this is definitely something we're thinking about working on at ki protect but this is something that I really hope overall as a field we can push forward and it has some interesting implications as well with ethics so there's a great paper again primary author or Cynthia twerk comparing this idea of differential privacy to also the same basis of ethics in a sense that if you do not know my race or if you do not know my gender or my age or something like this you have the potential to build a fairer model right and so I think that these have interesting overlaps and implications for our industry and I really hope that we start to think about them as a wholesale solution not just as a compliance only means I have to do this much so I this is what I'm hopeful for it's something that I look forward to seeing more in research and also chatting more with my peers and so forth yeah I like that because it sounds like in a certain way mindful data science in the sense that you just don't take all the data you have and throw a model at it and see see what comes out right yeah you you think about the implications you know of any other data that you share that you expose both to your team internally into any anyone externally that you think about you know essentially what I want somebody to do this with my data so the golden rule of data science yeah great so do you have a final call to action for all our listeners out there yeah sure hey you can check out all of the work that we're working on and we're looking for feedback still with ki protect so if you want to reach out we're at ki protect comm and then also just if you're working within the space if you're thinking about these problems keep keep at it you're not alone and also feel free to reach out that I think that we need to kind of create a really vocal community within data science that these are important these are essential and that this is not only for researchers although I'm really a big fan of what the research community has been doing but this is also something that practitioners care about and that we want to be able to implement what we're seeing in research and the advances that we're seeing in terms of potentially guaranteeing privacy preserving machine learning we want to see this like within the greater community and within the tools and open-source projects that we love and use Katherine it's been such a pleasure having you on the show thanks so much huger I really appreciate it thanks for joining our conversation with Kathryn about data security data privacy and the GD P R we saw that the GD P R gives consumers more rights and knowledge as to how their data is used such as rights to delete your data to know how your data is used and processed and to port your data to other services we also saw that for organizations and working data scientists although there is significant work to be done it will result in paying off a whole lot of already incurred technical debt in terms of developing better documentation and building better systems for tracking and managing data with a focus on user centered design we also saw once again the growing importance of data scientists thinking about the implications of their work and building ethical models and the importance of both privacy and consent by design do we really need to incorporate a user zip code into a model do we really need their last thousand purchases and why do I need access to their personally identifiable information does anyone really outside of perhaps customer service make sure to check out our next episode a conversation with Jonathan Nola's about organizing data science teams and the do's and don'ts of managing them whether dive into best practices for hiring data scientists Jonathan is a data science leader in the Seattle area with over a decade of experience he is currently running a consulting firm helping fortune 500 companies with data science machine learning and artificial intelligence we also tackle questions such as what is a more important skill for a data scientist the ability to use the most sophisticated deep learning models or being able to make good PowerPoint slides the answer may surprise you but then again it may not there's only one way to find out I'm your host Hugo Bound Anderson you can follow me on Twitter as you go down and data camp at data camp you can find all our episodes and show notes at data camp comm slash community slash podcast\n"