#4 How Data Science and Machine Learning are Shaping Digital Advertising (with Claudia Perlich)

The Evolving Role of Data Science in Online Advertising: A Conversation with Claudia Polich

Claudia Polich, a data scientist with a background in artificial network II, shifted her focus from more complex models to simpler and more interpretable ones like logistic regression. She values transparency and elegance in her models, which has become her go-to tool over the last 10-15 years.

Polich's change of heart was driven by her desire to impact the world and make a difference. Initially, she found herself drawn to the excitement of fancy models, but came to realize that it wasn't about finding the most complex or sophisticated model. Instead, it was about communicating effectively with others who might not be as familiar with technical jargon. She notes that sometimes, the worst possible model - the nearest neighbor - can actually have an advantage over more complex ones because it doesn't learn anything new. However, this model is often difficult to understand and may not perform well.

The key to success lies in establishing trust and getting stakeholders involved. Polich suggests that this can be achieved by using models like nearest neighbors, which allow others to identify what's being done and how the model is making its predictions. This transparency is crucial for building trust, especially when working with clients who may not have a technical background.

Polich's approach has been successful in her work with advertising companies, where she uses look-alike models to identify consumers who are similar to existing customers. These models can be seen as intuitive and relatable, making them more accessible to stakeholders who might otherwise be hesitant to adopt complex algorithms.

In contrast, more sophisticated models may struggle to explain their results or make it clear what's happening under the hood. Polich notes that this lack of transparency can lead to a failure to communicate effectively with stakeholders, which can ultimately undermine the success of the model.

Despite her shift towards simpler models, Polich remains committed to using advanced techniques and technologies. She encourages aspiring data scientists to keep their curiosity and skepticism sharp, but also to avoid taking themselves too seriously. By doing so, they'll be better equipped to identify potential pitfalls in their work and build more effective models that actually make a difference.

The conversation with Claudia Polich has highlighted the importance of transparency and communication in data science, particularly when working with stakeholders who may not be familiar with technical jargon. Her approach offers valuable insights into the role of data scientists as communicators and problem-solvers, and provides a cautionary tale about the dangers of over-reliance on complex models without adequate explanation.

Trust is also crucial in building effective relationships with clients and stakeholders. Polich's experience working with clients who were initially skeptical of her approach has shown that trust can be established through transparency and communication. By involving stakeholders in the process and making them feel part of something, it's possible to build a sense of ownership and accountability.

The final takeaway from our conversation with Claudia Polich is that simplicity and elegance are not mutually exclusive with advanced techniques and technologies. In fact, they can often complement each other beautifully. By striking a balance between complexity and communication, data scientists can create models that truly make a difference in the world.

This approach requires ongoing dialogue and a willingness to adapt to changing circumstances. As Polich notes, sometimes it's necessary to start with a model that may not be ideal - the nearest neighbor, for example - but one that gets the job done. By being cautious and responsible, data scientists can build models that are reliable, effective, and most importantly, explainable.

Ultimately, the role of data science in online advertising is about more than just building complex models or using advanced techniques. It's about making a difference, communicating effectively with stakeholders, and building trust through transparency and communication. By adopting this approach, aspiring data scientists can build a brighter future for themselves and those around them.

"WEBVTTKind: captionsLanguage: enin this episode of data framed a data count podcast I'll be speaking with claudia polish she scientists at distillery a role in which she designs develops analyzers and optimizes the machine learning algorithms that drive digital advertising will be discussing the role of data science in the online advertising world the predictability of humans how her team builds real-time bidding algorithms and detects bots online along with the ethical implications of all of these evolving concepts I'm Hugo ban Anderson a data scientist that data camp and this is data framed welcome to data frame a weekly data cam podcast exploring what data science looks like on the ground for working data scientists and what problems data science can solve I'm your host Hugo von Anderson you can follow me on twitter at you go down and you can also follow data cap at data cap you can find all of our episodes and show notes at data camp comm slash community slash podcast hi Claudia and welcome to data frames well thank you so much for having me it's such a pleasure to have you have you on the show and I'm really excited today to be talking about how data science and machine learning a shaping and reshaping digital advertising but before we get there I'd like to find out a bit about you okay what are you known for in the data science community so I have a couple of different hobbies even within data science I think what most people may know before that's almost where my quote fame started I used to participate in a lot of data mining competitions so you may recall the Netflix price where you could make a million dollars if you were substantially better at recommending movies than their existing algorithm and much earlier than that they have been in the field who did the geek world of machine learning competitions where people are trying to build the most accurate model on a data set that was provided by the organizer and I've been participating in those for quite a while and then I've won three in a row between 2007 and 2009 on one on breast cancer prediction the other one was on traditional telecommunication and also one of the Netflix data set although we didn't get a million dollars on that one so that's a little bit of my initial claim to fame that really helped being perceived as a kind of hard core part of the new learning community great and so I suppose one of the most famous platforms where this can happen now is kaggle yes so Cargill is basically the next gen when it became much more mainstream as machine learning and big data picked up they provided a very nice interface we're now not just organizers of this conference but in general nonprofit organizations companies all of them have a very easy way of interfacing with a huge community of thousands of people who are fun building these models so this speaks to some of how you got into data science but I want to probe a bit more into what what is there in your background that led you to to data science what type of skills did you develop or what what what jobs did you have or what did you study that that led you to data science right so I grew up in East Germany and other than knowing that it was good in math I didn't really have any convictions and no clear idea what I wanted to do with my life so my dad took me aside and said look they will need computers everywhere so why don't you study computer science when you feel good in math and you should be fine with that and that was really how I picked my first choice for my undergrad being computer science and his words were even more prophetic if you think about what happened to to data science now being a really extremely hard and all kind of different application areas I migrated into data science in 95 as an exchange student at cu-boulder when I took my first class on intelligence and artificial neural networks and just love the fact that you could learn so much about the world and all of these different fields by looking added through the lens of data aside from these kind of very measurable challenges of building the best possible models so that will just appeal to me from the first time I basically took my toe into data and I think this actually this is a nice lens through which to view the emergence of data science from your background right because you were good at math you studied computer science and then you moved into model building and predictive analytics and these are I suppose a three of the things which people associate most with whatever data science is these days the challenge with data science and I would argue even today are really with artificial intelligence is that it really depends on the background of the person you're talking to so I think an overall you are right with your characterization generally there is a sense that you also want statistics as well as some domain knowledge in the mix here I personally have found that the fact that for my PhD I moved from computer science into a business school here at NYU really shaped my focus away from the purely algorithmic to words being much more interested in what kind of problems you can solve with these tools and algorithms I think this is really where the birth of data science in its kind of broad application originates it's in using data in combination with algorithms to solve some very specific problems and so in that way it's actually domain centric and question century it should definitely start with a good question you can't just jump into a data set and dig around there and hope to find gold as it is sometimes portrait you do need to firstly we understand of what are the things you want to do is there any decision you can make because if there's nothing you can do then you don't need to waste your time on looking around in data so I really like to start with the constraints of the problem - then being led down the path towards what data is most appropriate and what algorithm can help me solve it this is great because I really wanted to talk about your work in digital advertising on your workers chief scientist at a distillery so perhaps you can speak to the types of questions that you use data science to solve in that work so before we get into this the story of my life really progresses from getting a PhD in information systems in a business school deciding that rather than taking an academic career path I really wanted to focus on these applications which brought me to IBM Watson where I stayed for six years and then I was lured into the world of digital advertising really by almost the promise of a golden land it's like a big sandbox to play and digital advertising has an incredible data footprint where you can experiment and really push these algorithms to the limit so it's it's a huge experimentation or field where you can understand how well you can predict human behavior which is typically somewhat limited but a lot better than random and when these methods in this conversation between correlation versus causation can help you to understand causality and when they don't so for me the excitement really was in not just a sheer amount of data but also the ability to try these things put these algorithms to the test and see how they perform in the real world great so can you give me the elevator pitch on digital advertising I don't think I can give you an elevator pitch off digital advertising I can tell you what we do and what I've been Willian trying to do for the last eight years so you may be familiar with the rise of the programmatic advertising world and what that means is that advertising are now being sold in real time options so every time you interact with your digital device whether this is reading new stories on the web or using an app you will be exposed to many different ads but these ads really we're bought in real time as the page loaded in one of these auctions and so what distillery specifically initially started out doing it was the promise of being able to pick extremely selectively the right person and moment by using predictive modeling that informed the automated bidding when such a good opportunity showed up and then adjusting bid prices for that so this is the core promise of programmatic advertising where you see everybody all day long and you can choose with very high precision when to interact with a customer now around this core problem there are a lot of kind of interesting quirky fun other things that can keep it a scientist excited so we have for instance problems around fraud there are a lot of the moment you have an open market where people can buy and sell same way you need on eBay reputation all of a sudden even in that environment you run into scenarios where this is not a really a real person who is sitting there when may or may not see the ad but in fact it's a part that was written for the sheer purpose of selling ads so that was a very interesting discovery back in 2012 where our models was really really good at predicting certain outcomes and the thing is usually people are not that predictable and then when try to understand what was going on so the performance was too good to be true because what's really are deterministic they're easy to predict and there was one example of these side problems that were having on the other end of the spectrum is we're already spending all of this effort of machine learning and AI on huge data footprints just for bidding but then the crimes come back and said well this is really amazing you have this great performance what did you do and other than drugging and saying well you build a predictive model in I don't know five hundred thousand dimensions that wasn't the right answer because they really wanted to understand what we may have found out about their potential future customers and so increasingly now we are looking into translating back what this artificial intelligence kind of found in this vast different behavioral patterns to be able to let us true the right ad but to answer more strategic questions about why do my customers actually buy my brand what's their perspective on the value proposition of the product and some of this you'll find encoded in these models that are very good at predicting and now increasingly we work with augmented reality to just give this information back to brands to help them understand what the customers are really doing yeah absolutely so there we have a question posed in non data science terms I mean well a job to do to build algorithms that will predict whether people will will click or not in order to make real-time bids in these these real-time options but then another non data science question emerges which is what type of wire our customers doing there so it's really a translation in in in both directions which i think is is incredibly interesting and it really speaks to this idea that data science doesn't exist in a vacuum as you stated initially that it responds to real-world questions and we also need to as data scientists translate our results to non data science people they'd be customers or managers or paper where we're consulting one of the most interesting quirks to this is as you said we're predicting people clicking we actually learned the hard way that it's a really bad idea to use powerful machine learning to learn when people click on ads the reason is the reason is people occasionally click on ads because they're interested in the product but much more often it's an accident when you're trying to either close it or just change the window but the fact it's an accident doesn't mean that that's random and these algorithms are good enough to find out how to predict the accidents and turns out it's much easier to predict accidents because they're typically either contextual that people don't pay attention so for instance we see very very high click-through rates on the flashlight app because they are people fumble in the dark and there's a very good chance that you accidentally hit the ad other scenarios might be just people with very kind of bad fine motor skills or eyesight problems just tend to be more prone to clicking which definitely doesn't mean that they're totally interest in the product so the interaction between really smart technology and optimization metrics that we're kind of ok for a very long time but now we have to educate back saying this is not a good idea to combine these two things because the technology's actually too powerful we need to think a lot harder what we're optimizing for because we may end up doing the exact wrong thing and I suppose what what's telling is that in all these cases the things that are easiest to predict are the things you don't want a pretty quitter it be BOTS clicking or people using the flashlight app I think I saw you give a talk once in which you had another very telling example of wanting to advertise Airport related stuff or travel related stuff people at airports I'm gonna get this wrong so I'd love for you to tell me that that example of if you remember correctly it's it's really a progression of what I just talked about so if you really listen to me saying we have thoughts that go and visit websites and then we have accidental clicks so what should we optimize towards and one thing I try to do is say well nobody accidentally goes to a physical location what we could try to do is predict whether people will go to restore or an interesting location and so I did two different experiments one was predicting who will go to a car dealership and that one actually was very successful really interesting to see the market research the different brands that consumer or prospective buyers look at before they then choose and go to that Mercedes car dealership and since it works so well and there is this other group that everybody always wants to reach which is the frequent travel on presumably because they have a lot of money probably bad conscience I need to bring the gifts home so it's always a great audience to reach and where would you expect to find them if not in airports but it turns out that the people who are much easier to find in airports are all the people who work there from the baggage handler or the people at the track in and they spent their whole day on their digital devices and because they're there every day and have very typical patterns of behavior so again by not thinking very clearly what else could explain what you're looking for we found mostly employees of JFK rather than the elusive freaking travel now it's time for a segment called tales from the open source today we're going to hear from Alan Nicol co-founder and CTO at raza who develop open source tools for conversational software hi Alan okay there you go thanks for having me on the show it's such a pleasure what are you gonna tell us about today I'd like to give a bit of background on a pair of open-source tools for building conversational software so that means chat BOTS voice assistants and anything else where you interact with the computer through natural language browser Nou is the first part and it says okay I've received a message what is this person trying to say and then browser core is the second part and it takes that information looks at the history of the conversation and decides what to do next they're relatively new but they're already used by thousands of developers that sounds really cool how did this originated back in 2015 we were building some bots on slack and we realized that the developer tools just weren't there to build really great conversational experiences and then in 2016 Facebook announced that they were going to open up messenger as a platform which meant that tens of thousands of developers would be building chat BOTS using keyword matching and a thousand if statements and we knew from experience that that just doesn't scale so Raza NLU and core are built entirely on machine learning and everything that your system understands and is able to do is learn from real conversations we're trying to get people to move away from thinking of conversations as flowcharts because it just doesn't work for anything beyond the hello world and what's been the response from the developer community at large we were totally blown away when we first released Rossignol you and I think it was just the right time in the right place it was a few months after the messenger platform opened up and everyone's still seriously developing BOTS wanted to be independent of the language understanding API is that the big tech companies were offering and we saw a huge amount of traffic right away and we've seen over 60,000 downloads in the first year we learned so much for our users and contributors open-source is really unique in that way people get stuck and then they care enough to really write a good issue and lots of them then actually go back and contribute to the project which is amazing rather also builds on a bunch of amazing open-source libraries like spacey scikit-learn and caris so what's next for razo our take is that browser Nou and core are meaningful steps in the right direction but making natural language interfaces to computers is an infinitely hard problem and we've definitely not solved it so we do a lot of Applied Research and sometimes we publish papers but mostly we care about turning that research into code people can actually use there's so many ways that natural language understanding and dialogue need to improve and I'm really excited about the things that we've got on the roadmap for Rosa thanks Alan for that dive into rasa and open source conversational software for those interested check out rasa AI that's built our a si in case you can't understand our silly accents thanks once again Alan real pleasure to chat let's now jump back into our interview with claudia polish you've hinted at this but part of your job is to predict human behavior and you've hinted that humans aren't so predictable but how predictable are they we are we well I was really fascinated by and this has nothing to do with advertising you may follow kind of these competitions that a eyes are now engaging - starting with the Tres games back in the 90s over recently we had jeopardy that what someone to go but none of them really is that much to do with predicting human behavior and more so with strategy what is fascinating that apparently now we have algorithms that can predict when people try really really hard to be not predictable so apparently we finally have beaten the world best poker players using algorithms that analyzes faces and apparently we give away and look more than we think but that's it's kind of a sight fun story really a real fear no not to the Machine and it may work with other people but the machine can still see right through it but in our daily activity I think a lot of us a lot about us is very predictable of us at least in my case I mean you can very quickly figure out what my daily habits are the other thing that is very predictable all kinds of consideration activities things that require gathering information and take a couple of I don't know days weeks months to come to a conclusion so for instance buying a new refrigerator for most people that's a serious thing and you spend some time thinking about it and you will have plenty of digital traces of that activity that helps marketers identify oh yeah these people are in the market for that product when you look at the more kind of spur-of-the-moment activities in those cases you might be able to identify that this person is even prone - kind of a bye on short notice of an apple while walking over the Union Square market because the person likes Apple but can you predict that at this particular moment the person will feel like buying an apple and seeing one that they want probably not so there is a huge range and you have seen this you've probably heard about one of the studies done on Facebook data there was really concerning from a privacy perspective this is not about predicting future behavior but what researchers showed is there are a lot of parts of our personality and behaviors that we may not want to make public but that can be very easily inferred so for instance sexual orientation political perspectives all of these things are really easy to infer from a machine learning algorithm just given your day to day activity and so that isn't something necessarily that will help Distillery do their job though right or is it this is typically not what we're interested in in the sense that we are looking to optimize very specific metrics by the marketer such as number of new customers signing up for service on the our website and for that I'm not really interested in any of these concerns about who you really are and things like sexual orientation and it's also perfectly anonymized in the sense that I don't know any personally identifiable information about you and even your browsing and digital activity can basically be hashed and obscured so it doesn't believe mean anything now this being said I think in general when you now turn it around and go back to the client and they want to know but what did you find I think in that point the boundary between where do I start infringing at audiences kind of privacy when I fear certain correlations with character traits becomes really interesting and the one of the zoning examples for instance even from our frequent traveler we did find flight attendant sites but we also found gay dating sites now is this something that I should be seeing as to something I should have that such an easy access towards and often when I tell about the story people in the audience feel somewhat ill at ease about the fact that this is even something that can be that easily revealed and this is something relatively new in the technological landscape so presumably there is an even legislation hasn't caught up with these types of challenges yet from a societal level so we are really struggling right now finding new ways of possibly self constraining how we interact with these technologies and where the line is what type of data do you have access to that helps you predict human behavior at distillery the data sources in digital advertising are really coming from many many different places and different players have very different access rights so you only see specialization obviously Facebook knows everything you do on Facebook and will provide versions of that data on to their advertisers we have accessed through these real-time options basically to every auction that happens and we're talking about 100 billion events every day so 100 billion times we are being told that this particular device is right now looking at this particular content and so you have this constant stream that ultimately then gets assembled into an an activity history very granular nature like the URL and the news for instance that you read alongside with location information if the bit request came from your mobile device so if you're just standing on the corner you're bored and you're playing I don't know candy crush 15 and there's an ad ruin your phone just told me that you're standing there unless you were very diligent about switching off the GPS and so this is kind of one of the primary sources is actually the environment itself through which ads are being sold and in addition to that you also have many data vendors who are providing additional information that they have collected of similar granular form interesting so do you have access to for example how how much time people spend on websites even like cursor activity if they have other apps open any of this type of stuff so the details of your web activity typically remain behind the scenes so what you do with your cursor really requires the integration in your browser which is far beyond anything that is available kind of broadly in the advertising environment now sometimes the ad itself could have a technology but for instance tracks how long it is in view and whether or not you went with the cursor over it something similar happens in the mobile space when you use your digital devices what's called the SDK is basically the almost the operating system is the fundamental software that is underlying most of the apps that people develop and they themselves might collect data about what you do with the app but also other apps that are being installed there's kind of an ongoing on attempt from for instance Apple to restrict that apps stick to the rules and only kind of look at their own data and only share their own data but there's a lot that comes directly from these deeply integrated parts of the software stack that is providing you apps we've talked about the masses of data that businesses such as distillery have with respect to how you know a lot of people interacting with online environments with the online world which is taking up more and more of our of our daily lives what are ways in which these masses of data can can be used for for Social Research so there have been really great pieces of social research recently and I think especially around the rise of fake news and how people interact with information propagate information that lead ultimately to these kind of informational bubbles that are being enhanced by the AI that themself that's trying to predict what you may want to read about and so there are number of researchers that incredibly important work because it goes beyond just understanding the social nature of our modern generations it really comes to the fundamental questions of how do we now progress with democracy moving forward if we no longer have even a remote hope for objective information when things are shared algorithmically and so some of the work gilad wrote on has done I also really recommend the work that's done at the data and society group here out of New York City and then you also have various of these pieces coming from Google and Microsoft so you see a much increased need for this understanding and now we also have much more access to people actually do and understanding how these processes work so consumers have a relatively complex relationship with with advertising what value can data science in the online advertising space add to the consumers experience my overarching sense is that by helping valuable content to monetize we are part of the ecosystem that allows publishers to be ultimately somewhat independent with the decline of subscription many publishers and even blogs where everybody can express themselves have to rely on advertising as a primary source of income I'm not necessarily convinced that we are truly providing the much-needed information I think having a possibly less disruptive experience and have advertising be part of the fabric that it fits both my interest and the kind of topic of the site where it's being displayed is an acceptable compromise whereas what I'm seeing with a lot of concern as advertising has increasingly focus more on view ability as a metric meaning for instance that advertisers only want to pay for viewable ads this makes initially sense from the perspective of the advertiser but it then puts the publisher in a really difficult position because as a result of it you have these absolutely terrible experiences as the user where the ad is kind of following you around on the page and there's no way to get rid of it and then you have of course high click-through rates because every time we try to close that something happens and eventually I think that really only fosters the installation of ad blockers which now becomes a bigger concern to publishers that are trying to provide independent and free content to readers if large groups of readers install ad blockers that's a great point because I think there is a certain balancing act out that we're talking about whereby as a publisher you want to get your stuff out there as much as possible but you don't necessarily want to spam people at least to the point of annoying them enough for them to take that type of action yes so we're kind of going down the path of discussing a few ethical implications of of your work and data science in general I'd like I'd like to go down this path a bit a bit further you you hinted to the idea of biases in data and an algorithmic bias and I was wondering if you could speak some more challenges involved in these areas today so with the vast deployment of automated systems there have been an increased number of concerns on the ethical side of the implications that these algorithms you may have the simplest or earliest one I think you could refer to is the information bubble where the algorithm isn't necessarily biased it's just really good at figuring out what you'd like to hear and as a result when it comes to more important things like political information if you only hear the side of the story that you like to hear not only does it kind of reinforce your precision but it gives you the delusion of being absolutely right and certain about it and you no longer have to question yourself or seek the dialogue with other opinions so I think this is one of the early concerns that has nothing to do with even the technology be biased but the way our brain processes information in the interaction with a kind of pre selection that appeases us very much and this is even know now in popular culture as echo-chamber right exactly this is another term that we have forward now the next generation of concerns that were brought forth or with respect to users in areas for instance as predictive policing or even something as simple as recommendations on various job sites where the concern is that we have for the better of the birds our society has certain biases our behavior is not up to the overall standard that we want it to be and as a result if you know train models on behavioral data where for instance you have never hired a woman for this position therefore you have no data of woman ever being successful therefore none of the candidates that we'll find will be female so the concern is that we could somewhat accidentally propagate or potentially even increase biases that was existent in the data that was used to build a model that now behaves exactly as we used to and not necessarily true already so this is an example as as you state of algorithms encoding already exists in a societal or human human biases which is something we need to be very cognizant of moving forward I know something else you're interested in though is the ability for algorithms to create their own biases which may not be even existing in the data so I'd love to hear your thoughts on that so in the experience and we touched on this earlier when we talked about thoughts and clicks Wow and even the people working in the airports one of the things I understood is that ultimately when you build a predictive model it's just doing exactly this it's going to find the easiest thing to explain wherever it finds the most signal or the most information and that becomes a problem when different groups of your population have more or less signal more or less information and so the example that I like to bring forth for people to consider if for instance a group of people has a consistently lower usage of technology and as such I have less data points about that person I would be much less likely to target the person either with advertising or with a job offer simply because the model can never quite be sure then this is the right choice to make and there are other easier things to predict so if you look for instance at jobs if for some reason it is easier to predict success for one gender than the other although both are equally likely to succeed what happens if you simply use your algorithm to rank candidates you can easily see very strong majorities of the same gender in the top ten candidates presented although originally 50% of the people who succeeded in that poll were actually male so there was a balance male/female representation the data that's the concern that I'm having where a lot of the conversations today around making sure that your training set is unbiased it is not enough to ensure that your training set is what I call first-order unbiased meaning you have the exact representation that you want you still have to take the responsibility for taking action on the predictions because the predictions can be biased again and then you as the user who is the platform have to make a choice to present a again equalized outcome to pick the top end candidates for both genders for instance and you're aware of this and clearly trying to what do these types of things in your work but do you think enough people are aware of this and if not is educating them part of our job so I am an interesting experience I went and gave a keynote at predictive analytics world this fall in New York City and I said look exactly about this and after my presentation the general chair walked up and asked the audience for how many of you knew that this happened and I think intuitively most data scientists are kind of aware of it but in this audience I would say maybe out of 200 people we had 10 15 hands going up and the rest of them may have an inkling but possibly not fully solved all the way to the implication of what that means and even the bigger challenge what now to do about it so I still like to give that same talk although I have been giving it for at least one and a half years now because I do find it very important that as a commune we understand the implications of our work and that it's not enough to delegate it even to legal restrictions or things like D biasing datasets we still need to take responsibility for the usage of this technology now it's time for a segment called statistical pitfalls I'm here with Michael Bettencourt Applied statistician and one of the core developers of the open source statistical modeling platform Stan great to have you on the show Mike thanks you go it's great to be here so you're here today to tell us about a common statistical trap that we all fall into that of the tyranny of the main right very much so in fact this is one of my favorite pitfalls so many mistakes the statistics are made when the mean of a population is confused with a typical individual and that population in particular when someone tries to use an average individual to characterize a typical individual and by average individual you mean an individual whose features are given by the average or main characteristics of everyone in the population right so for example the proxy individual who has the average height of everyone in the population the average weight the average distance from arm to shoulder or whatever feature that we're taking into account so why is this such a bad idea well we already know that the mean doesn't tell us anything about the variation in the population but we're not trying to characterize the entire population here right we're just trying to represent a typical individual in the population unfortunately if we consider more than a few features that almost no individuals look anything like the mean really yes it's an extremely counter to a phenomenon so let's consider what happens as we look at more and more features well start by considering a single feature say height of the individuals in the population now if we went out and sampled random individuals in the population then most of the samples would have Heights near the mean height of the population so the main is pretty typical in one dimension then right but that quickly falls apart as I start looking at more and more features for example what happens if we consider both the height and the weight of individuals in our population how many individuals have both average height and average weight well there have to be fewer than before because now each individual has two ways in which they can deviate away from the mean and as we consider more features we get more and more ways in which the individuals can deviate away from the mean and the probability that an individual doesn't vary and at least one of those ways quickly Falls to zero in other words the neighborhood around the mean quickly des populates as we consider more and more features even we're considering as few as five features almost no individual in the population looks anything like the me that average individual is completely atypical of the population so why is this relevant to practicing data scientists and statisticians well this behavior is immediately important we're trying to design products or processes that are out to be suitable to most of the population for example let's consider designing a bike coming if the dimensions of the helmet are based on tight tolerances around the average head the resulting product will be uncomfortable or unsafe for nearly everyone in the population if we want a helmet they'll be effective for a substantial percent of the population then we can't design for a single individual the helmets have to be adjustable or at least big enough for the biggest heads interesting but this is just one example mathematically this behavior is a manifestation of a phenomenon called concentration of measure and this arises anytime you use probability distribution to characterize a high dimensional space this could be for example a distribution that characterizes the features in the population what we talked about before could be a distribution that characterizes the variability of data and a measurement process or it can even view the distribution of parameters in Bayesian inference anytime we have a probability distribution on a space it's more than a few dimensions the neighborhoods of high probability will be very far away from the mean we have to be constantly vigilant of the tyranny of the mean my thanks for that introduction to the subtle but ever-present statistical pitfall known as the tyranny of the mean my pleasure here you go be careful in those high dimensional spaces everyone time to get straight back into our chat with Claudia and so in terms of responsibility what what is the role of data scientists to think about data ethics particularly in a world where we're reaching a point where I mean advertisers may know us better than ourselves I don't think we need to worry about that specifically okay so I just think in indicate there's there's the example the anecdotal example of if someone's we discussed cars before if if someone has displayed interest in sports cars maybe you advertise flashy flashy cars to them but if they display an interest in sports cars your algorithm knows that they may be in debt they also have a history of alcohol abuse these types of things what what type of ethical considerations need to be in place to to help in this type of situation so first of I think it is important to have an honest and open conversation about it what I have perceived is you basically have two different groups your people who do data science for living and very rightly concerned citizens often with insufficient depth of understanding of what is even controllable or can be known about these algorithms and in some sense I am the best police for data science because I'm the one closest to building them and observing these things a lot of the examples I talk about whether this is the case of train FK or even clicks and pots a lot of this happens behind the scenes and it's really my kind of curiosity and diligence to find those things and I would like us to have been more open to scores what we expect from this technology and what they come person that we want to put into place it's the fright level and what I want to talk about here is not exactly the direction that you are going with some of these abuse cases but more so when when we are looking at failures of machine learning and AI when there is an accident by a self-driving car when we have mislabeled pictures showing up that could possibly be offensive what does the night expectation my sense is that society feels that this technology has to be perfect and I think this is where the disconnect in the conversation is because when you are doing this for living you do understand that ultimately these systems can be a lot better and can do a lot of good for instance diagnosing rare diseases that the doctor that you happen to go to in some rural area has never encountered before but will that system be perfect almost truly not and so the answer to how do we as a society engage with that in my opinion has to be one of realistic expectations and a sense of collaboration between machine and human with a shared responsibility for the action that ultimately we choose to take based on the recommendations that that we get what if I do observe specific cases that hinder for instance that some people that I observed in the advertising environment are suicidal is there something I should do do I need to point out to the brand that this might be a group of constituents that they somehow have responsibility for I'm not sure but I feel I would like at least want to be able to speak up without being pushed into the corner of privacy violating because privacy doesn't make these things go wait just listen exactly so you're speaking to said your openness and transparency which i think is incredibly important and it also I think it's heartening and helps that there are people such as yourself who are on one side working as data scientists in such businesses but are also communicators and explainers and take that Duty upon themselves to go out and and speak about these types of issues in public fora which which is is very welcome and necessary I think thank you I appreciate that we've discussed a lot about the modern data science landscape what does the future of data science look like to you so on the one hand side I think the appreciation for really all the upside potential that data has has will continue so I don't think this is a fluke I was really excited about the fact that even though big data as I as a hyper term is coming to its end but the increased sensitivity that we as a society but also institutions and firms have that they should be more on data based or data driven in their decisions I think that's very important I think it's also very important as these systems exist that we as a society become what data literate because recommender systems are not going to go away and we need to understand that we are living in these kind of filter bubbles or echo chambers that were mentioned before what does it mean for data science itself so first of I'm not worried of Askim automating ourselves I mean we are automating ourselves all the time but I think the demand for human skill and supervision of data science systems will only rise and technology can really not make up for good human intuition at the crucial role it can play in exactly these concerns we have X some of the things that go wrong and what we can trust the machinery and how we should interact with it the tooling is incredibly elegant today if you compare that to twenty years ago I think we will see more of that and tooling really being broadly available through either the cloud providers or many other open access tools I I do believe that the current excitement about deep learning will come to a realization that it's not the answer to every comma deep learning is very good for very specific types of problems and they are really around areas that have a lot of signal so we're talking about vision where you have very clear rules of the physical world that can be exploited we're talking about the language we will get still better about translation and automatic conversion of audio to text and obviously we've seen this and reinforcement learning which is where all of these games go and so on come from but there will be a lot of space for good old kind of solid statistics just on bigger data and simple models so I'm quite optimistic for the for the field with the understanding that these different tools will find their different places and speaking of solid statistics what's one of your favorite techniques or methodologies for data science not necessarily favorites something you just enjoy implementing or doing I'm very old-fashioned in the sense that I don't trust myself looking at graphs graphs on the rate if I want to tell stories so if I want to tell the story about people fumbling in the dark and it's very nice to kind of eliminate these things with information about clicks rates but I really like to look at data almost running over my street but this probably just me being really weird that's okay that wasn't what you were asking um I have somewhat ironically taken almost the opposite development than the field I started out doing artificial network ii 95 and then i downgraded if you want to decision trees in 2004 for my dissertation and today I really value on the simplicity and elegance and also transparency that you can get for linear models like logistic regression or even just simple indexing that you would probably refer to as form of naive Bayes because it's so much easier to look under the hood and understand what might be going monolayer and that really has become my go-to tool over the last I would say 10 15 years in fact I want all of my dinner mining competitions using some form of a logistic model firstly I love the idea of just you watching data stream across multiple screens secondly I think your your passion for interpretive all models for decision trees for linear models where you can actually communicate what what certain things mean in these models also speaks to what we're discussing before your role is a communicator so you can take the output of what one of these models put outputs and speak to a data science manager or someone in HR or whatever it is or someone in the advertising space who is in technical about the results of these models right so this is exactly I think why I go 8-8 towards it because initially I was kicking out over the fancy stuff and I've come to realize that if you want to impact the world it doesn't matter what you find exciting what matters are what you can get other people excited about and so depending on the kind of sophistication level it's often a really great idea to have the worst possible model that's the nearest neighbor the nearest neighbors awful it almost never has really good performance compared to some of the more sophisticated because it doesn't really learn anything you just kind of find something that's similar but it's very difficult to know what similar means but here's one huge advantage that's exactly how people think that's the reason why in advertising we talk about look-alike model now what we bill does not look like morals but to people to understand yeah we find other consumers who look like you're consumers that makes sense and they can relate and they can embrace the technology and start giving it at least to try and then after a couple of iterations you can swap out that awful nearest neighbor and give them a really good predictive model and they will be very happy moving forward and I suppose it's about establishing trust as well in that sense that you may have a model that performs better but if nobody has any idea what it's doing I don't know why they should should have faith in it or trust it so Trust is definitely very very important here and the other part is simply get them involved because that's what you can do with nearest neighbors you can say yeah here are the five most similar other cases and then the person can say nah that one that one doesn't count because dad was completely different said okay let's delete an address book was the four so it has this nice communication where they feel that they have become part of something and at least that was the case in one of the projects at IBM trust was a component but it was also that they felt taken seriously and part of the process and we learned when our models actually had no data and we would have to build something entirely different so those cases where the customer just knew that this was not appropriate fantastic so my final question is do you have a final call to action for our listeners who are aspiring and working data scientists alike my sense is number one just keep your curiosity and your skepticism I mean have fun with what you do but always don't take yourself too seriously and definitely not your model so having some appreciation when you find out why something went wrong that's much more fun and interesting then finding out that something went right so as a philosophy moving forward being cautious with the things that you build and I think that plays into being responsible when you hand them over and clear where you think the limitations are but first and foremost just keep your excitement for it because that will keep your shot and being able to identify these things Claudia thank you so much for coming on the show this has been such a great pleasure chatting with you thanks for joining our conversation with claudia polish about the evolving role of data science in the online advertising world we discovered a lot about the predictability of humans but also that our algorithms will often pick out the targets that are easiest to describe such as online bots we also saw the importance of an ongoing and increasing dialogue between data scientists and the population at large in a world that is becoming increasingly defined by the data we all produce make sure to check out our next episode a conversation with ben's cranker a data scientist the convoy a company dedicated to revolutionizing the north american trucking industry with data sciencein this episode of data framed a data count podcast I'll be speaking with claudia polish she scientists at distillery a role in which she designs develops analyzers and optimizes the machine learning algorithms that drive digital advertising will be discussing the role of data science in the online advertising world the predictability of humans how her team builds real-time bidding algorithms and detects bots online along with the ethical implications of all of these evolving concepts I'm Hugo ban Anderson a data scientist that data camp and this is data framed welcome to data frame a weekly data cam podcast exploring what data science looks like on the ground for working data scientists and what problems data science can solve I'm your host Hugo von Anderson you can follow me on twitter at you go down and you can also follow data cap at data cap you can find all of our episodes and show notes at data camp comm slash community slash podcast hi Claudia and welcome to data frames well thank you so much for having me it's such a pleasure to have you have you on the show and I'm really excited today to be talking about how data science and machine learning a shaping and reshaping digital advertising but before we get there I'd like to find out a bit about you okay what are you known for in the data science community so I have a couple of different hobbies even within data science I think what most people may know before that's almost where my quote fame started I used to participate in a lot of data mining competitions so you may recall the Netflix price where you could make a million dollars if you were substantially better at recommending movies than their existing algorithm and much earlier than that they have been in the field who did the geek world of machine learning competitions where people are trying to build the most accurate model on a data set that was provided by the organizer and I've been participating in those for quite a while and then I've won three in a row between 2007 and 2009 on one on breast cancer prediction the other one was on traditional telecommunication and also one of the Netflix data set although we didn't get a million dollars on that one so that's a little bit of my initial claim to fame that really helped being perceived as a kind of hard core part of the new learning community great and so I suppose one of the most famous platforms where this can happen now is kaggle yes so Cargill is basically the next gen when it became much more mainstream as machine learning and big data picked up they provided a very nice interface we're now not just organizers of this conference but in general nonprofit organizations companies all of them have a very easy way of interfacing with a huge community of thousands of people who are fun building these models so this speaks to some of how you got into data science but I want to probe a bit more into what what is there in your background that led you to to data science what type of skills did you develop or what what what jobs did you have or what did you study that that led you to data science right so I grew up in East Germany and other than knowing that it was good in math I didn't really have any convictions and no clear idea what I wanted to do with my life so my dad took me aside and said look they will need computers everywhere so why don't you study computer science when you feel good in math and you should be fine with that and that was really how I picked my first choice for my undergrad being computer science and his words were even more prophetic if you think about what happened to to data science now being a really extremely hard and all kind of different application areas I migrated into data science in 95 as an exchange student at cu-boulder when I took my first class on intelligence and artificial neural networks and just love the fact that you could learn so much about the world and all of these different fields by looking added through the lens of data aside from these kind of very measurable challenges of building the best possible models so that will just appeal to me from the first time I basically took my toe into data and I think this actually this is a nice lens through which to view the emergence of data science from your background right because you were good at math you studied computer science and then you moved into model building and predictive analytics and these are I suppose a three of the things which people associate most with whatever data science is these days the challenge with data science and I would argue even today are really with artificial intelligence is that it really depends on the background of the person you're talking to so I think an overall you are right with your characterization generally there is a sense that you also want statistics as well as some domain knowledge in the mix here I personally have found that the fact that for my PhD I moved from computer science into a business school here at NYU really shaped my focus away from the purely algorithmic to words being much more interested in what kind of problems you can solve with these tools and algorithms I think this is really where the birth of data science in its kind of broad application originates it's in using data in combination with algorithms to solve some very specific problems and so in that way it's actually domain centric and question century it should definitely start with a good question you can't just jump into a data set and dig around there and hope to find gold as it is sometimes portrait you do need to firstly we understand of what are the things you want to do is there any decision you can make because if there's nothing you can do then you don't need to waste your time on looking around in data so I really like to start with the constraints of the problem - then being led down the path towards what data is most appropriate and what algorithm can help me solve it this is great because I really wanted to talk about your work in digital advertising on your workers chief scientist at a distillery so perhaps you can speak to the types of questions that you use data science to solve in that work so before we get into this the story of my life really progresses from getting a PhD in information systems in a business school deciding that rather than taking an academic career path I really wanted to focus on these applications which brought me to IBM Watson where I stayed for six years and then I was lured into the world of digital advertising really by almost the promise of a golden land it's like a big sandbox to play and digital advertising has an incredible data footprint where you can experiment and really push these algorithms to the limit so it's it's a huge experimentation or field where you can understand how well you can predict human behavior which is typically somewhat limited but a lot better than random and when these methods in this conversation between correlation versus causation can help you to understand causality and when they don't so for me the excitement really was in not just a sheer amount of data but also the ability to try these things put these algorithms to the test and see how they perform in the real world great so can you give me the elevator pitch on digital advertising I don't think I can give you an elevator pitch off digital advertising I can tell you what we do and what I've been Willian trying to do for the last eight years so you may be familiar with the rise of the programmatic advertising world and what that means is that advertising are now being sold in real time options so every time you interact with your digital device whether this is reading new stories on the web or using an app you will be exposed to many different ads but these ads really we're bought in real time as the page loaded in one of these auctions and so what distillery specifically initially started out doing it was the promise of being able to pick extremely selectively the right person and moment by using predictive modeling that informed the automated bidding when such a good opportunity showed up and then adjusting bid prices for that so this is the core promise of programmatic advertising where you see everybody all day long and you can choose with very high precision when to interact with a customer now around this core problem there are a lot of kind of interesting quirky fun other things that can keep it a scientist excited so we have for instance problems around fraud there are a lot of the moment you have an open market where people can buy and sell same way you need on eBay reputation all of a sudden even in that environment you run into scenarios where this is not a really a real person who is sitting there when may or may not see the ad but in fact it's a part that was written for the sheer purpose of selling ads so that was a very interesting discovery back in 2012 where our models was really really good at predicting certain outcomes and the thing is usually people are not that predictable and then when try to understand what was going on so the performance was too good to be true because what's really are deterministic they're easy to predict and there was one example of these side problems that were having on the other end of the spectrum is we're already spending all of this effort of machine learning and AI on huge data footprints just for bidding but then the crimes come back and said well this is really amazing you have this great performance what did you do and other than drugging and saying well you build a predictive model in I don't know five hundred thousand dimensions that wasn't the right answer because they really wanted to understand what we may have found out about their potential future customers and so increasingly now we are looking into translating back what this artificial intelligence kind of found in this vast different behavioral patterns to be able to let us true the right ad but to answer more strategic questions about why do my customers actually buy my brand what's their perspective on the value proposition of the product and some of this you'll find encoded in these models that are very good at predicting and now increasingly we work with augmented reality to just give this information back to brands to help them understand what the customers are really doing yeah absolutely so there we have a question posed in non data science terms I mean well a job to do to build algorithms that will predict whether people will will click or not in order to make real-time bids in these these real-time options but then another non data science question emerges which is what type of wire our customers doing there so it's really a translation in in in both directions which i think is is incredibly interesting and it really speaks to this idea that data science doesn't exist in a vacuum as you stated initially that it responds to real-world questions and we also need to as data scientists translate our results to non data science people they'd be customers or managers or paper where we're consulting one of the most interesting quirks to this is as you said we're predicting people clicking we actually learned the hard way that it's a really bad idea to use powerful machine learning to learn when people click on ads the reason is the reason is people occasionally click on ads because they're interested in the product but much more often it's an accident when you're trying to either close it or just change the window but the fact it's an accident doesn't mean that that's random and these algorithms are good enough to find out how to predict the accidents and turns out it's much easier to predict accidents because they're typically either contextual that people don't pay attention so for instance we see very very high click-through rates on the flashlight app because they are people fumble in the dark and there's a very good chance that you accidentally hit the ad other scenarios might be just people with very kind of bad fine motor skills or eyesight problems just tend to be more prone to clicking which definitely doesn't mean that they're totally interest in the product so the interaction between really smart technology and optimization metrics that we're kind of ok for a very long time but now we have to educate back saying this is not a good idea to combine these two things because the technology's actually too powerful we need to think a lot harder what we're optimizing for because we may end up doing the exact wrong thing and I suppose what what's telling is that in all these cases the things that are easiest to predict are the things you don't want a pretty quitter it be BOTS clicking or people using the flashlight app I think I saw you give a talk once in which you had another very telling example of wanting to advertise Airport related stuff or travel related stuff people at airports I'm gonna get this wrong so I'd love for you to tell me that that example of if you remember correctly it's it's really a progression of what I just talked about so if you really listen to me saying we have thoughts that go and visit websites and then we have accidental clicks so what should we optimize towards and one thing I try to do is say well nobody accidentally goes to a physical location what we could try to do is predict whether people will go to restore or an interesting location and so I did two different experiments one was predicting who will go to a car dealership and that one actually was very successful really interesting to see the market research the different brands that consumer or prospective buyers look at before they then choose and go to that Mercedes car dealership and since it works so well and there is this other group that everybody always wants to reach which is the frequent travel on presumably because they have a lot of money probably bad conscience I need to bring the gifts home so it's always a great audience to reach and where would you expect to find them if not in airports but it turns out that the people who are much easier to find in airports are all the people who work there from the baggage handler or the people at the track in and they spent their whole day on their digital devices and because they're there every day and have very typical patterns of behavior so again by not thinking very clearly what else could explain what you're looking for we found mostly employees of JFK rather than the elusive freaking travel now it's time for a segment called tales from the open source today we're going to hear from Alan Nicol co-founder and CTO at raza who develop open source tools for conversational software hi Alan okay there you go thanks for having me on the show it's such a pleasure what are you gonna tell us about today I'd like to give a bit of background on a pair of open-source tools for building conversational software so that means chat BOTS voice assistants and anything else where you interact with the computer through natural language browser Nou is the first part and it says okay I've received a message what is this person trying to say and then browser core is the second part and it takes that information looks at the history of the conversation and decides what to do next they're relatively new but they're already used by thousands of developers that sounds really cool how did this originated back in 2015 we were building some bots on slack and we realized that the developer tools just weren't there to build really great conversational experiences and then in 2016 Facebook announced that they were going to open up messenger as a platform which meant that tens of thousands of developers would be building chat BOTS using keyword matching and a thousand if statements and we knew from experience that that just doesn't scale so Raza NLU and core are built entirely on machine learning and everything that your system understands and is able to do is learn from real conversations we're trying to get people to move away from thinking of conversations as flowcharts because it just doesn't work for anything beyond the hello world and what's been the response from the developer community at large we were totally blown away when we first released Rossignol you and I think it was just the right time in the right place it was a few months after the messenger platform opened up and everyone's still seriously developing BOTS wanted to be independent of the language understanding API is that the big tech companies were offering and we saw a huge amount of traffic right away and we've seen over 60,000 downloads in the first year we learned so much for our users and contributors open-source is really unique in that way people get stuck and then they care enough to really write a good issue and lots of them then actually go back and contribute to the project which is amazing rather also builds on a bunch of amazing open-source libraries like spacey scikit-learn and caris so what's next for razo our take is that browser Nou and core are meaningful steps in the right direction but making natural language interfaces to computers is an infinitely hard problem and we've definitely not solved it so we do a lot of Applied Research and sometimes we publish papers but mostly we care about turning that research into code people can actually use there's so many ways that natural language understanding and dialogue need to improve and I'm really excited about the things that we've got on the roadmap for Rosa thanks Alan for that dive into rasa and open source conversational software for those interested check out rasa AI that's built our a si in case you can't understand our silly accents thanks once again Alan real pleasure to chat let's now jump back into our interview with claudia polish you've hinted at this but part of your job is to predict human behavior and you've hinted that humans aren't so predictable but how predictable are they we are we well I was really fascinated by and this has nothing to do with advertising you may follow kind of these competitions that a eyes are now engaging - starting with the Tres games back in the 90s over recently we had jeopardy that what someone to go but none of them really is that much to do with predicting human behavior and more so with strategy what is fascinating that apparently now we have algorithms that can predict when people try really really hard to be not predictable so apparently we finally have beaten the world best poker players using algorithms that analyzes faces and apparently we give away and look more than we think but that's it's kind of a sight fun story really a real fear no not to the Machine and it may work with other people but the machine can still see right through it but in our daily activity I think a lot of us a lot about us is very predictable of us at least in my case I mean you can very quickly figure out what my daily habits are the other thing that is very predictable all kinds of consideration activities things that require gathering information and take a couple of I don't know days weeks months to come to a conclusion so for instance buying a new refrigerator for most people that's a serious thing and you spend some time thinking about it and you will have plenty of digital traces of that activity that helps marketers identify oh yeah these people are in the market for that product when you look at the more kind of spur-of-the-moment activities in those cases you might be able to identify that this person is even prone - kind of a bye on short notice of an apple while walking over the Union Square market because the person likes Apple but can you predict that at this particular moment the person will feel like buying an apple and seeing one that they want probably not so there is a huge range and you have seen this you've probably heard about one of the studies done on Facebook data there was really concerning from a privacy perspective this is not about predicting future behavior but what researchers showed is there are a lot of parts of our personality and behaviors that we may not want to make public but that can be very easily inferred so for instance sexual orientation political perspectives all of these things are really easy to infer from a machine learning algorithm just given your day to day activity and so that isn't something necessarily that will help Distillery do their job though right or is it this is typically not what we're interested in in the sense that we are looking to optimize very specific metrics by the marketer such as number of new customers signing up for service on the our website and for that I'm not really interested in any of these concerns about who you really are and things like sexual orientation and it's also perfectly anonymized in the sense that I don't know any personally identifiable information about you and even your browsing and digital activity can basically be hashed and obscured so it doesn't believe mean anything now this being said I think in general when you now turn it around and go back to the client and they want to know but what did you find I think in that point the boundary between where do I start infringing at audiences kind of privacy when I fear certain correlations with character traits becomes really interesting and the one of the zoning examples for instance even from our frequent traveler we did find flight attendant sites but we also found gay dating sites now is this something that I should be seeing as to something I should have that such an easy access towards and often when I tell about the story people in the audience feel somewhat ill at ease about the fact that this is even something that can be that easily revealed and this is something relatively new in the technological landscape so presumably there is an even legislation hasn't caught up with these types of challenges yet from a societal level so we are really struggling right now finding new ways of possibly self constraining how we interact with these technologies and where the line is what type of data do you have access to that helps you predict human behavior at distillery the data sources in digital advertising are really coming from many many different places and different players have very different access rights so you only see specialization obviously Facebook knows everything you do on Facebook and will provide versions of that data on to their advertisers we have accessed through these real-time options basically to every auction that happens and we're talking about 100 billion events every day so 100 billion times we are being told that this particular device is right now looking at this particular content and so you have this constant stream that ultimately then gets assembled into an an activity history very granular nature like the URL and the news for instance that you read alongside with location information if the bit request came from your mobile device so if you're just standing on the corner you're bored and you're playing I don't know candy crush 15 and there's an ad ruin your phone just told me that you're standing there unless you were very diligent about switching off the GPS and so this is kind of one of the primary sources is actually the environment itself through which ads are being sold and in addition to that you also have many data vendors who are providing additional information that they have collected of similar granular form interesting so do you have access to for example how how much time people spend on websites even like cursor activity if they have other apps open any of this type of stuff so the details of your web activity typically remain behind the scenes so what you do with your cursor really requires the integration in your browser which is far beyond anything that is available kind of broadly in the advertising environment now sometimes the ad itself could have a technology but for instance tracks how long it is in view and whether or not you went with the cursor over it something similar happens in the mobile space when you use your digital devices what's called the SDK is basically the almost the operating system is the fundamental software that is underlying most of the apps that people develop and they themselves might collect data about what you do with the app but also other apps that are being installed there's kind of an ongoing on attempt from for instance Apple to restrict that apps stick to the rules and only kind of look at their own data and only share their own data but there's a lot that comes directly from these deeply integrated parts of the software stack that is providing you apps we've talked about the masses of data that businesses such as distillery have with respect to how you know a lot of people interacting with online environments with the online world which is taking up more and more of our of our daily lives what are ways in which these masses of data can can be used for for Social Research so there have been really great pieces of social research recently and I think especially around the rise of fake news and how people interact with information propagate information that lead ultimately to these kind of informational bubbles that are being enhanced by the AI that themself that's trying to predict what you may want to read about and so there are number of researchers that incredibly important work because it goes beyond just understanding the social nature of our modern generations it really comes to the fundamental questions of how do we now progress with democracy moving forward if we no longer have even a remote hope for objective information when things are shared algorithmically and so some of the work gilad wrote on has done I also really recommend the work that's done at the data and society group here out of New York City and then you also have various of these pieces coming from Google and Microsoft so you see a much increased need for this understanding and now we also have much more access to people actually do and understanding how these processes work so consumers have a relatively complex relationship with with advertising what value can data science in the online advertising space add to the consumers experience my overarching sense is that by helping valuable content to monetize we are part of the ecosystem that allows publishers to be ultimately somewhat independent with the decline of subscription many publishers and even blogs where everybody can express themselves have to rely on advertising as a primary source of income I'm not necessarily convinced that we are truly providing the much-needed information I think having a possibly less disruptive experience and have advertising be part of the fabric that it fits both my interest and the kind of topic of the site where it's being displayed is an acceptable compromise whereas what I'm seeing with a lot of concern as advertising has increasingly focus more on view ability as a metric meaning for instance that advertisers only want to pay for viewable ads this makes initially sense from the perspective of the advertiser but it then puts the publisher in a really difficult position because as a result of it you have these absolutely terrible experiences as the user where the ad is kind of following you around on the page and there's no way to get rid of it and then you have of course high click-through rates because every time we try to close that something happens and eventually I think that really only fosters the installation of ad blockers which now becomes a bigger concern to publishers that are trying to provide independent and free content to readers if large groups of readers install ad blockers that's a great point because I think there is a certain balancing act out that we're talking about whereby as a publisher you want to get your stuff out there as much as possible but you don't necessarily want to spam people at least to the point of annoying them enough for them to take that type of action yes so we're kind of going down the path of discussing a few ethical implications of of your work and data science in general I'd like I'd like to go down this path a bit a bit further you you hinted to the idea of biases in data and an algorithmic bias and I was wondering if you could speak some more challenges involved in these areas today so with the vast deployment of automated systems there have been an increased number of concerns on the ethical side of the implications that these algorithms you may have the simplest or earliest one I think you could refer to is the information bubble where the algorithm isn't necessarily biased it's just really good at figuring out what you'd like to hear and as a result when it comes to more important things like political information if you only hear the side of the story that you like to hear not only does it kind of reinforce your precision but it gives you the delusion of being absolutely right and certain about it and you no longer have to question yourself or seek the dialogue with other opinions so I think this is one of the early concerns that has nothing to do with even the technology be biased but the way our brain processes information in the interaction with a kind of pre selection that appeases us very much and this is even know now in popular culture as echo-chamber right exactly this is another term that we have forward now the next generation of concerns that were brought forth or with respect to users in areas for instance as predictive policing or even something as simple as recommendations on various job sites where the concern is that we have for the better of the birds our society has certain biases our behavior is not up to the overall standard that we want it to be and as a result if you know train models on behavioral data where for instance you have never hired a woman for this position therefore you have no data of woman ever being successful therefore none of the candidates that we'll find will be female so the concern is that we could somewhat accidentally propagate or potentially even increase biases that was existent in the data that was used to build a model that now behaves exactly as we used to and not necessarily true already so this is an example as as you state of algorithms encoding already exists in a societal or human human biases which is something we need to be very cognizant of moving forward I know something else you're interested in though is the ability for algorithms to create their own biases which may not be even existing in the data so I'd love to hear your thoughts on that so in the experience and we touched on this earlier when we talked about thoughts and clicks Wow and even the people working in the airports one of the things I understood is that ultimately when you build a predictive model it's just doing exactly this it's going to find the easiest thing to explain wherever it finds the most signal or the most information and that becomes a problem when different groups of your population have more or less signal more or less information and so the example that I like to bring forth for people to consider if for instance a group of people has a consistently lower usage of technology and as such I have less data points about that person I would be much less likely to target the person either with advertising or with a job offer simply because the model can never quite be sure then this is the right choice to make and there are other easier things to predict so if you look for instance at jobs if for some reason it is easier to predict success for one gender than the other although both are equally likely to succeed what happens if you simply use your algorithm to rank candidates you can easily see very strong majorities of the same gender in the top ten candidates presented although originally 50% of the people who succeeded in that poll were actually male so there was a balance male/female representation the data that's the concern that I'm having where a lot of the conversations today around making sure that your training set is unbiased it is not enough to ensure that your training set is what I call first-order unbiased meaning you have the exact representation that you want you still have to take the responsibility for taking action on the predictions because the predictions can be biased again and then you as the user who is the platform have to make a choice to present a again equalized outcome to pick the top end candidates for both genders for instance and you're aware of this and clearly trying to what do these types of things in your work but do you think enough people are aware of this and if not is educating them part of our job so I am an interesting experience I went and gave a keynote at predictive analytics world this fall in New York City and I said look exactly about this and after my presentation the general chair walked up and asked the audience for how many of you knew that this happened and I think intuitively most data scientists are kind of aware of it but in this audience I would say maybe out of 200 people we had 10 15 hands going up and the rest of them may have an inkling but possibly not fully solved all the way to the implication of what that means and even the bigger challenge what now to do about it so I still like to give that same talk although I have been giving it for at least one and a half years now because I do find it very important that as a commune we understand the implications of our work and that it's not enough to delegate it even to legal restrictions or things like D biasing datasets we still need to take responsibility for the usage of this technology now it's time for a segment called statistical pitfalls I'm here with Michael Bettencourt Applied statistician and one of the core developers of the open source statistical modeling platform Stan great to have you on the show Mike thanks you go it's great to be here so you're here today to tell us about a common statistical trap that we all fall into that of the tyranny of the main right very much so in fact this is one of my favorite pitfalls so many mistakes the statistics are made when the mean of a population is confused with a typical individual and that population in particular when someone tries to use an average individual to characterize a typical individual and by average individual you mean an individual whose features are given by the average or main characteristics of everyone in the population right so for example the proxy individual who has the average height of everyone in the population the average weight the average distance from arm to shoulder or whatever feature that we're taking into account so why is this such a bad idea well we already know that the mean doesn't tell us anything about the variation in the population but we're not trying to characterize the entire population here right we're just trying to represent a typical individual in the population unfortunately if we consider more than a few features that almost no individuals look anything like the mean really yes it's an extremely counter to a phenomenon so let's consider what happens as we look at more and more features well start by considering a single feature say height of the individuals in the population now if we went out and sampled random individuals in the population then most of the samples would have Heights near the mean height of the population so the main is pretty typical in one dimension then right but that quickly falls apart as I start looking at more and more features for example what happens if we consider both the height and the weight of individuals in our population how many individuals have both average height and average weight well there have to be fewer than before because now each individual has two ways in which they can deviate away from the mean and as we consider more features we get more and more ways in which the individuals can deviate away from the mean and the probability that an individual doesn't vary and at least one of those ways quickly Falls to zero in other words the neighborhood around the mean quickly des populates as we consider more and more features even we're considering as few as five features almost no individual in the population looks anything like the me that average individual is completely atypical of the population so why is this relevant to practicing data scientists and statisticians well this behavior is immediately important we're trying to design products or processes that are out to be suitable to most of the population for example let's consider designing a bike coming if the dimensions of the helmet are based on tight tolerances around the average head the resulting product will be uncomfortable or unsafe for nearly everyone in the population if we want a helmet they'll be effective for a substantial percent of the population then we can't design for a single individual the helmets have to be adjustable or at least big enough for the biggest heads interesting but this is just one example mathematically this behavior is a manifestation of a phenomenon called concentration of measure and this arises anytime you use probability distribution to characterize a high dimensional space this could be for example a distribution that characterizes the features in the population what we talked about before could be a distribution that characterizes the variability of data and a measurement process or it can even view the distribution of parameters in Bayesian inference anytime we have a probability distribution on a space it's more than a few dimensions the neighborhoods of high probability will be very far away from the mean we have to be constantly vigilant of the tyranny of the mean my thanks for that introduction to the subtle but ever-present statistical pitfall known as the tyranny of the mean my pleasure here you go be careful in those high dimensional spaces everyone time to get straight back into our chat with Claudia and so in terms of responsibility what what is the role of data scientists to think about data ethics particularly in a world where we're reaching a point where I mean advertisers may know us better than ourselves I don't think we need to worry about that specifically okay so I just think in indicate there's there's the example the anecdotal example of if someone's we discussed cars before if if someone has displayed interest in sports cars maybe you advertise flashy flashy cars to them but if they display an interest in sports cars your algorithm knows that they may be in debt they also have a history of alcohol abuse these types of things what what type of ethical considerations need to be in place to to help in this type of situation so first of I think it is important to have an honest and open conversation about it what I have perceived is you basically have two different groups your people who do data science for living and very rightly concerned citizens often with insufficient depth of understanding of what is even controllable or can be known about these algorithms and in some sense I am the best police for data science because I'm the one closest to building them and observing these things a lot of the examples I talk about whether this is the case of train FK or even clicks and pots a lot of this happens behind the scenes and it's really my kind of curiosity and diligence to find those things and I would like us to have been more open to scores what we expect from this technology and what they come person that we want to put into place it's the fright level and what I want to talk about here is not exactly the direction that you are going with some of these abuse cases but more so when when we are looking at failures of machine learning and AI when there is an accident by a self-driving car when we have mislabeled pictures showing up that could possibly be offensive what does the night expectation my sense is that society feels that this technology has to be perfect and I think this is where the disconnect in the conversation is because when you are doing this for living you do understand that ultimately these systems can be a lot better and can do a lot of good for instance diagnosing rare diseases that the doctor that you happen to go to in some rural area has never encountered before but will that system be perfect almost truly not and so the answer to how do we as a society engage with that in my opinion has to be one of realistic expectations and a sense of collaboration between machine and human with a shared responsibility for the action that ultimately we choose to take based on the recommendations that that we get what if I do observe specific cases that hinder for instance that some people that I observed in the advertising environment are suicidal is there something I should do do I need to point out to the brand that this might be a group of constituents that they somehow have responsibility for I'm not sure but I feel I would like at least want to be able to speak up without being pushed into the corner of privacy violating because privacy doesn't make these things go wait just listen exactly so you're speaking to said your openness and transparency which i think is incredibly important and it also I think it's heartening and helps that there are people such as yourself who are on one side working as data scientists in such businesses but are also communicators and explainers and take that Duty upon themselves to go out and and speak about these types of issues in public fora which which is is very welcome and necessary I think thank you I appreciate that we've discussed a lot about the modern data science landscape what does the future of data science look like to you so on the one hand side I think the appreciation for really all the upside potential that data has has will continue so I don't think this is a fluke I was really excited about the fact that even though big data as I as a hyper term is coming to its end but the increased sensitivity that we as a society but also institutions and firms have that they should be more on data based or data driven in their decisions I think that's very important I think it's also very important as these systems exist that we as a society become what data literate because recommender systems are not going to go away and we need to understand that we are living in these kind of filter bubbles or echo chambers that were mentioned before what does it mean for data science itself so first of I'm not worried of Askim automating ourselves I mean we are automating ourselves all the time but I think the demand for human skill and supervision of data science systems will only rise and technology can really not make up for good human intuition at the crucial role it can play in exactly these concerns we have X some of the things that go wrong and what we can trust the machinery and how we should interact with it the tooling is incredibly elegant today if you compare that to twenty years ago I think we will see more of that and tooling really being broadly available through either the cloud providers or many other open access tools I I do believe that the current excitement about deep learning will come to a realization that it's not the answer to every comma deep learning is very good for very specific types of problems and they are really around areas that have a lot of signal so we're talking about vision where you have very clear rules of the physical world that can be exploited we're talking about the language we will get still better about translation and automatic conversion of audio to text and obviously we've seen this and reinforcement learning which is where all of these games go and so on come from but there will be a lot of space for good old kind of solid statistics just on bigger data and simple models so I'm quite optimistic for the for the field with the understanding that these different tools will find their different places and speaking of solid statistics what's one of your favorite techniques or methodologies for data science not necessarily favorites something you just enjoy implementing or doing I'm very old-fashioned in the sense that I don't trust myself looking at graphs graphs on the rate if I want to tell stories so if I want to tell the story about people fumbling in the dark and it's very nice to kind of eliminate these things with information about clicks rates but I really like to look at data almost running over my street but this probably just me being really weird that's okay that wasn't what you were asking um I have somewhat ironically taken almost the opposite development than the field I started out doing artificial network ii 95 and then i downgraded if you want to decision trees in 2004 for my dissertation and today I really value on the simplicity and elegance and also transparency that you can get for linear models like logistic regression or even just simple indexing that you would probably refer to as form of naive Bayes because it's so much easier to look under the hood and understand what might be going monolayer and that really has become my go-to tool over the last I would say 10 15 years in fact I want all of my dinner mining competitions using some form of a logistic model firstly I love the idea of just you watching data stream across multiple screens secondly I think your your passion for interpretive all models for decision trees for linear models where you can actually communicate what what certain things mean in these models also speaks to what we're discussing before your role is a communicator so you can take the output of what one of these models put outputs and speak to a data science manager or someone in HR or whatever it is or someone in the advertising space who is in technical about the results of these models right so this is exactly I think why I go 8-8 towards it because initially I was kicking out over the fancy stuff and I've come to realize that if you want to impact the world it doesn't matter what you find exciting what matters are what you can get other people excited about and so depending on the kind of sophistication level it's often a really great idea to have the worst possible model that's the nearest neighbor the nearest neighbors awful it almost never has really good performance compared to some of the more sophisticated because it doesn't really learn anything you just kind of find something that's similar but it's very difficult to know what similar means but here's one huge advantage that's exactly how people think that's the reason why in advertising we talk about look-alike model now what we bill does not look like morals but to people to understand yeah we find other consumers who look like you're consumers that makes sense and they can relate and they can embrace the technology and start giving it at least to try and then after a couple of iterations you can swap out that awful nearest neighbor and give them a really good predictive model and they will be very happy moving forward and I suppose it's about establishing trust as well in that sense that you may have a model that performs better but if nobody has any idea what it's doing I don't know why they should should have faith in it or trust it so Trust is definitely very very important here and the other part is simply get them involved because that's what you can do with nearest neighbors you can say yeah here are the five most similar other cases and then the person can say nah that one that one doesn't count because dad was completely different said okay let's delete an address book was the four so it has this nice communication where they feel that they have become part of something and at least that was the case in one of the projects at IBM trust was a component but it was also that they felt taken seriously and part of the process and we learned when our models actually had no data and we would have to build something entirely different so those cases where the customer just knew that this was not appropriate fantastic so my final question is do you have a final call to action for our listeners who are aspiring and working data scientists alike my sense is number one just keep your curiosity and your skepticism I mean have fun with what you do but always don't take yourself too seriously and definitely not your model so having some appreciation when you find out why something went wrong that's much more fun and interesting then finding out that something went right so as a philosophy moving forward being cautious with the things that you build and I think that plays into being responsible when you hand them over and clear where you think the limitations are but first and foremost just keep your excitement for it because that will keep your shot and being able to identify these things Claudia thank you so much for coming on the show this has been such a great pleasure chatting with you thanks for joining our conversation with claudia polish about the evolving role of data science in the online advertising world we discovered a lot about the predictability of humans but also that our algorithms will often pick out the targets that are easiest to describe such as online bots we also saw the importance of an ongoing and increasing dialogue between data scientists and the population at large in a world that is becoming increasingly defined by the data we all produce make sure to check out our next episode a conversation with ben's cranker a data scientist the convoy a company dedicated to revolutionizing the north american trucking industry with data science\n"