#63 The Past and Present of Data Science (with Sergey Fogelson)

The Challenges and Benefits of Data Democratization: A Conversation with Sergey Fokelson

Data democratization is a crucial aspect of making data science more accessible and impactful across various organizations and industries. It involves providing access to data while also understanding its sources, structure, and limitations. In this article, we will delve into the world of data democratization through a conversation with Sergey Fokelson, CEO of Monte Carlo Data.

The Importance of Context in Data Democratization

Sergey emphasizes that having an appreciation for how data was collected and stored is essential to understanding its value. For instance, he mentions that if the way sessions are stored separately from users, it's challenging to determine the number of unique users on a platform over the past month. This highlights the importance of context in data democratization. If senior executives don't appreciate this aspect, they may not be able to understand the true implications of their decisions.

The Role of Humility in Data Science

Sergey stresses the significance of humility when it comes to data science and evangelizing for data science. He notes that most people have the best interests of their companies at heart but may not be aware of the complexities involved in data processing, cleaning, and storage. By practicing humility, individuals can recognize the need for growth and learning, which ultimately leads to a more meaningful life.

The Value of Contextual Understanding

Sergey believes that understanding how data was sourced and structured is crucial for effective evangelism. When people appreciate the complexity of data, they become more aware of potential gaps in the system. This awareness enables them to identify and address issues, making data democratization more successful. By providing context around data sourcing and structure, individuals can better understand its limitations and nuances.

The Formula for Data Democratization

Sergey emphasizes that democratizing data access and providing context around how data is sourced and structured are essential components of the formula for data democratization. This approach allows organizations to make informed decisions and unlock the full potential of their data.

Conclusion

In conclusion, Sergey's insights on data democratization provide a valuable perspective on the challenges and benefits involved in making data science more accessible and impactful across various organizations and industries. By understanding the importance of context, humility, and contextual understanding, individuals can become more effective advocates for data democratization, leading to better decision-making and a more meaningful life.

A Call to Action: Practicing Humility in Data Science

Sergey's conversation with Adele highlights the importance of practicing humility when it comes to data science. He emphasizes that most people have the best interests at heart but may not be aware of the complexities involved in data processing, cleaning, and storage. By recognizing this aspect and practicing humility, individuals can develop a growth mindset, leading to a more meaningful life.

Learning is Key

Sergey's experience with learning new techniques, frameworks, and tools has been instrumental in making his work more impactful. He emphasizes that staying humble and committed to continuous learning are essential for success in data science and beyond. By embracing this approach, individuals can unlock their full potential and make a more meaningful contribution.

Extreme Gradient Boosting on Data Camp

Sergey recommends Extreme Gradient Boosting on Data Camp taught by himself as an excellent resource for those looking to learn about data science. This course provides hands-on experience with the latest techniques and tools in the field, allowing individuals to develop their skills and become more effective advocates for data democratization.

Final Thoughts

In conclusion, Sergey's conversation with Adele has provided valuable insights into the world of data democratization. By understanding the importance of context, humility, and contextual understanding, individuals can become more effective advocates for data democratization. As we move forward in this rapidly evolving field, it is essential to prioritize humility, continuous learning, and contextual understanding to unlock the full potential of data science.

"WEBVTTKind: captionsLanguage: enhello this is adele neme from datacamp and welcome to dataframed a podcast covering all things data and its impact on organizations across the world one thing we're looking forward to covering in more detail on the podcast is not only the latest insights on how data science is impacting organizations today but how the field has evolved and is evolving towards democratizing data science for all this is why i'm excited to have sergey fogleson on for today's episode sergey began his career as an academic at dartmouth college in hanover new hampshire where he researched the neural bases of visual category learning and obtained his phd in cognitive neuroscience after leaving academia sergey got into the rapidly growing startup scene in the new york city metro area where he has worked as a data scientist in digital advertising cyber security finance and media currently he's the vice president of data science and modeling at viacom cbs where he leads a team of data scientists and analysts that work on a variety of awesome use cases in this episode sergey and i discuss his background how data science has evolved since he got into the field the major challenges he thinks data teams and professionals face today his best practices gaining buy-in from business executives on data projects and his best practices when democratizing data science in the organization and more if you want to check out previous episodes of the podcast and show notes make sure to go to www.datacamp.com community podcast sergey i'm really excited to have you on the show i've been excited to have this chat on the state of data science your experiences leading data teams and democratizing data science but beforehand can you please give our listeners a background on how you got into data science sure would love to thank you for having me uh dell i'm really excited uh also to speak with you about all of this stuff so my academic background is in a i'm cognitive neuroscience so i got my graduate degree in cognitive neuroscience applying ml algorithms to functional neuro imaging data so basically what this means is put people into large scanners record their brain activity and then try to decode what's actually happening in their brains using machine learning algorithms uh and what i knew was probably about halfway through my phd i knew i didn't really want to stay in academia and i knew i wanted to work on interesting data related or data intensive problems and uh when i was at that point in my uh phd so this is around 2010 to i heard about this thing that that people were talking about called big data uh they didn't really have a term for data science at this at this moment in time and so i just knew that there was this field where you could you could sort of use you know still in its infancy but you could sort of use the same kinds of algorithms that i was using for neural imaging work but applied to uh real world data sets so data sets in in uh advertising in finance uh in quantitative analysis uh kind of all over the place and so basically i started looking into this stuff started reading about it and in my last year i really made a hard push to try to get into the industry and i wound up being able to kind of land a job in the world outside of academia and haven't really looked back since so what were some of the earlier data science projects that you worked on and how has that shaped your path leading data science to viacom yeah uh i've had i i would like to think that i've had pretty varied experiences but maybe not i think they're reasonably eclectic uh so i started my kind of the very beginning of my career i worked for a digital advertising startup and there the uh the big two problems i worked on one was a classification problem and i think it's still a pretty relevant problem i don't think this problem has really been solved yet and it's the idea of taking ip addresses and trying to understand what kind of a place that ip address represents so for example um is this an ip address associated with a home is this ip address associated with an airport or a starbucks or a you know some other business is it an educational uh ip address etc so there is some metadata associated with that information uh but it's not 100 accurate um so what you can do is you can take the signals that are coming out of that ip address uh to make probabilistic inferences about whether you think it's a home or not and that was really important for the work that we were doing because the way that we were building uh the main product that this company was selling is called a device graph basically it tells you whether any two devices belong to the same household or not being able to do that and being able to build those those those links across devices is really critical to understand whether something is a home or not whether that that device is living or is being seen within a home-based environment or not so that was kind of um one of the first projects i had i also worked a little bit on this on this graph building problem that i just briefly mentioned earlier so the idea here is again you're trying to figure out whether two distinct devices two two phones an iphone in a tablet or an android phone and a smart tv for example whether they uh belong to the same person or to the same household or not and again this relies on uh some uh some network analysis problems as well or techniques rather and then on um really thinking about how to be able to do this at very very large scales so back then this was again this was like 2014. uh there really weren't a lot of uh large-scale data analysis frameworks like spark was basically at like 0.1 or something i mean it was a completely new project so it was just very difficult to and interesting to tackle these kinds of problems that that involved working with data at scale so that was kind of my first my first foray into data science and then after that um i moved into uh cyber security so i worked at the cyber security startup for a little while and there the the most important problem that i tackled was really what we called hack predictions so the idea was hey given a company's cyber security footprint so uh you know the number of ip addresses that they have exposed to the public internet uh if you can snoop for example and see uh what kinds of uh software they're running on computers in on those ip addresses so on servers or on personal computers etc uh you can actually see if the software is all up to date we know that you know if that software is not up to date it can be hacked in various different ways but the idea is as you take all of these kinds of signals and then you assign a probability score to what the likelihood is that this company's going to get hacked within six months or within a year or within two or three years uh so we called that a hack prediction and then i moved and uh worked for a small data consultancy where we actually worked for a large investment bank and we worked on what's called an automated account reconciliation problem so this is not particularly attractive from like a data analysis of perspective but it's actually super critical from a back office perspective the idea here is you have two distinct accounting systems uh they occasionally do not line up with each other they need to be what's called reconciled and you need to basically assign a likelihood that they actually need to be manually reconciled or they can be uh basically dealt with by other downstream automated systems so this is almost like a health check uh that happens at one point in this massive reconciliation process that happens every day within within i would say every major investment bank in in in the world where you're you're trying to basically make sure that your books line up at the end of the day and this was something that had been done by thousands of people thousands at this when we first started this project there was over a thousand people that were actually hired explicitly to do this so to manually check all of these records and so we did was we basically took years and years worth of their manual checks and just put a machine learning algorithm on top of that right it was we built this ensemble model and you could say like look given this metadata associated with these trades what's the likelihood that they need to be reconciled manually or or basically surfaced up to a manual reconciler versus just pass it through the system anyway long story short using machine learning on past human performance actually worked surprisingly well we wound up being able to uh automate basically 90 of the reconciliation process in this way so only the most difficult to reconcile records wound up being actually validated by human beings which meant that those people that were hired to do this can now do other more meaningful more impactful stuff so i think that was that was an overall win yeah i think all of these projects that you've engaged in are super useful in the sense that it gives you this breadth of experience in this data space and this is one thing that i really want to pick your brains on is really reflecting on how data science has evolved over the past decade or so for example you mentioned that some of the problems you were working on spark was still in 0.1 0.2 there weren't really these mature data analysis frameworks i'd love to pick your thoughts on what are the some of the major changes that you've seen occur over the past decade and how do you really see it playing out within data science teams today yeah that's a i think that's a a really interesting thing to to talk about so when i first started again so i started in 2013 uh everything was on hadoop so basically my first work i was working in pig jobs then i was working in hive and then i was also working using a framework that came out of twitter called scalding so it was basically uh hadoop but using scala so writing scala jobs so hadoop basically only now exists from the way that i see it in the industry as a legacy system people are not going out and saying if i'm going to build a new state-of-the-art data architecture i'm going to use hadoop i don't know anybody that or any company that actually makes that a flashy thing that they describe uh now what would we have we have uh spark is basically completely built out and is almost completely taken over uh data science from like a data processing etl and even um ml kind of perspective so there's there's not really any need to think about about data from like a map reduced perspective uh and in fact like i haven't touched a mapreduce job in probably i want to say three and a half to four years or something uh and now so before you have most of your data in flat files right so again when i was working when i was working at uh uh the digital advertising startup uh we basically had you know petabytes of data in flat files that were either compressed csvs or back then what was this revolutionary new data format called parquet which now again is super standard across the industry but nowadays it's actually so cheap i mean it's still reasonably expensive if you have really really massive data sets but it's so cheap to put terabytes of data into structured data warehouses uh that you can actually query data sets that are on the scale of tens or maybe even hundreds of terabytes i don't i don't i haven't heard of anybody having petabyte scale data warehouses at least within my industry because we don't really have petabyte scale data yet but i assume that there's probably somebody in finance or in especially in web scale companies that are probably dealing with petabyte scale data warehouses where you can run basically structured sql queries that'll give you results within at most you know five minutes or something which is just completely unheard of uh eight years ago when i when i started so that's i think the first really big difference i think that that people have basically moved there's still people working with flat files especially with unstructured data right like you can't just put unstructured data into a relational database into a sql database so if you have text or you have images i think it's very hard to still work with those in a more structured environment you can definitely put the metadata associated with them into a database but i don't think you can actually put the raw the raw data itself into a database so there is a place still for unstructured data but for the most part if your data is tabular and structured it's going to be in a massive data warehouse it's not really going to be in a flat in flat files anymore unless it's completely unprocessed completely raw and you're just using putting it there for like legacy purposes or for for safekeeping purposes the next big thing that's that i think has happened and this is really the actual revolutionary thing i think the most revolutionary thing is really um but again it's not the sexy stuff it's the orchestration data pipelining frameworks for actually being able to automate data jobs on some periodic basis so we're talking about you know uh frameworks like luigi flame frameworks like airflow i mean there are obviously other ones um airflow is what i use now uh fairly regularly or folks on my team uh work or use fairly regularly but that whole idea of basically cron for data science again that didn't really mean there were people that were basically using cron uh in legacy you know uh uh linux or or unix-based frameworks for data pipelining for etl processes uh but they they really hadn't come into their own at that moment in time they were still in their infancy now basically everyone has some kind of a data uh pipelining framework that they use um on across some kinds of jobs within their within their um data teams right that i think is really really important that's really the stuff that allows you to increase the velocity of your data workflows right instead of having to figure out how to automate that stuff you can basically just build it once forget about it it's scheduled it's run there's alerting there's error reporting all of that stuff is kind of baked in you have a you know you have a front end all of that stuff it's just it's really really incredibly valuable okay so that's that's kind of the gut works or skunk works revolution that has happened in data science then on the more ml standard side of things there's the fact that we now have uh way more machine learning frameworks uh and the vast majority of common machine learning algorithms do not have to be written from scratch so again i'll i'll come back to my first gig working in digital advertising uh one thing we had to do was basically do uh connected components analysis so after you've built your graph you need to understand what the size is of all of your basically many connected clusters your households and the problem is is when you're dealing with a graph that contains hundreds of millions or billions of edges that was a non-trivial thing to actually be able to compute so i remember we actually had to write our own connected components algorithm on top of scala so it was like you wrote a scalding job that created this connected components graph and that took you know that was basically probably three months of work or something or doing that stuff now you take a network analysis package uh an off-the-shelf package and it's just baked in and it can handle graphs with hundreds of millions of edges and hundreds of millions of nodes without a problem uh other things so again when i was starting out i mean there was really only like one implementation of gradient boosting that i remember seeing and then all of a sudden basically everyone has a gradient boosting implementation and most uh gradient boost implementation frameworks now have gpu options and they're really really fast and they're really really robust i mean the point i'm trying to make here is that before you had it you had machine learning algorithms that you basically had to implement from scratch or they were really really difficult to get them up and running at scale now it's not only that you don't have to implement them from scratch right i mean you have it you have implementations in various languages they're very very fast there are robust communities across each of these ml frameworks and in many cases there are gpu versions of things so now they're way way faster than they than they used to be so again i mean you just have open source frameworks there's lots of them they're much much faster than they used to be and they're much more extensible than they used to be so i think there's been a crazy revolution there but again i don't know that that is nearly as important as i think the skunk works stuff that i talked about earlier the next things i think are visualization and explainability so uh when i started again there were very very limited explainability methods for non-linear algorithms so for linear algorithms common thing you can do right is you just look at you just look at regression coefficients and that gives you basically the full story when it comes to nonlinear methods it's much much harder to say what the actual impact of a given feature is on a specific prediction but that's changing right so now we have some very powerful explainability methods so i can think of chap or shapley value explanations we have lime which works for it's more of like a local based method for explanations for visual problems or problems in image recognition and these algorithms really have foundationally allowed uh practitioners to quickly see where signal is uh in your in your feature space and where it isn't right ultimately this this this really kind of it has significantly accelerated how quickly you can iterate on creating new features feature engineering right in ways that we really couldn't do in the kind of the non-linear algorithm space so again line shaft neither of these things existed when i when i started as an aside for that but i think something that is much bigger now at least within the past like i would say year to two years is this notion of feature stores so the idea here is that is that you can actually create basically databases or tables that contain the latest versions of the features that you're using for your machine learning algorithms and you can quickly update them you can quickly source them instead of having to recreate them in some process uh you can almost think of them as like yeah it's like tables that you regularly update with the latest versions of whatever your your features you're using to feed downstream models and what's cool about feature stores is that you can you know you can use them across multiple different uh different model spaces right if you have a new if you have a new data scientist that comes in and you're like hey you're going to be working on this churn model you can immediately point them to a place where all of the latest greatest features exist for that turn model or from other models that have been built in the past and so this person can be very quickly brought up to speed with what seems to be working in the problem space that you're that you're attempting to tackle right um that again future stores were not something that existed when i when i started yeah thank you so much for this really interesting rundown of all of that perspective that you've seen evolved in data science over the years uh and i think really like a broad theme that you're talking about here is really the move from experimentation to operationalization and productionization right and definitely agree with you as well on the skunk revolution basically that you talked about about how orchestration platforms and really the move to centralized data warehousing at least for tabular data has really changed the game for data scientists and their ability to provide value quickly and to scale their work and i think this is a testament to how much the tooling stack for data scientists over the past years has evolved and i'd love your insights then on the flip side on where you think there is still room for improvement and where do you think that the data tooling stack is headed yeah so i i think we're really the the largest place for improvement isn't really on the side of you know greater better faster machine learning algorithms i think at this point i mean yes you can always have a better you know neural network model that captures 0.1 more performance for this specific task i think that's always going to be the case and i think there's always a place for that but i think it's the operationalization bit that still has the most legs to really grow and mature across all aspects of data science i mean i i talked about you know i talked about feature stores and these orchestration frameworks and pipelining frameworks but they're still i mean they're they they work but in in many respects they're still fairly brittle right the automated model performance monitoring isn't still really where i think it would be really great that it could get to right so another thing i think a place where there could still be some significant improvement or maybe i just i haven't found i haven't seen the right product but seamlessly updating uh different aspects of the ml pipeline right so if you think of like all aspects of the ml pipeline as as being as modular as possible so for example everything from data loading pre-processing feature generation or just the kind of the standard when you think of the jpeg of the data science uh typical data science pipeline all of those parts you treat them as if they're like individual completely isolated boxes that can just be popped in and popped out but they really can't i mean the current issues are that you know if you want to change uh or add a new data source to some to some to some process or tweak a data source for some process it's really not seamless it's not as simple as like hey point to this end point and then your model will just magically understand how this data is structured what needs to be done to convert it etc i mean you basically have to it's it's you basically have to touch every single box in that uh jpeg that you have for your typical data pipeline and change it in some specific way so it's really not completely modular in the way that you know we think of pure modularity and there's there are good things to that they're bad things to that but i think there's still some some uh some some improvements that can be done uh in that perspective and then in general more around this this modularity stuff but really more on the real-time side of things so right now uh you can you can get a lot of stuff done for batch machine learning models so what what i mean by that is you basically have a machine learning model that you build at some point time point in time a then it runs for some specified amount of time and while it's running you're getting new data coming into the system and after that time has passed you basically recreate a new model that you then reinsert at some pre-specified time hopefully at a time when it's not system critical that the model is performing at 100 capacity so you basically take it out and you put a new one in so this is what i think of as like batch model building and data science um the real time stuff is much trickier right so there is some you know there there is work there are algorithms uh that that do uh real-time uh updating right so you can do real-time gradient descent updates you can do some real-time updates on coefficients for you know in linear models whatever um but what i'm talking about is actually what if for example you could in real time add in a new feature or take out a feature that isn't performing currently the way you have to do that is you have to do it as if it's a batch process right you basically have to turn off that old machine learning model and create a new machine learning model where you remove the specific feature or add a new feature do whatever it is that you're doing and then put that in so obviously you can try to make that as seamless as possible inseam and make it seem as though it was the same model but really it's not right and so ultimately i think this idea of being able to in real time modify uh uh modify machine machine learning algorithms uh in in certain use cases that might be very very business critical for cert for for certain businesses thankfully that's uh that's not nearly as business critical in in some of the in many of the arenas that that my team uh tackles data science uh projects but but i can see that being being an issue um so i think basically to a long story short i think it's it's this combination of like hey the the pipelining and orchestration stuff still has lots of legs um the and then real-time uh model updating or real-time changes uh both across the pipeline and uh uh at the very end where we're actually talking about the model itself i think there's there's really lots of improvement that can be done there and with this evolution in mind by covering the uh relatively technical limitations that are still present within data teams today what do you think are some of the other biggest challenges affecting data science teams yeah so i i think there are uh two kind of overarching non-technical but still very important challenges that need to be tackled so i think the first one is just a lack of consistent kind of industry-wide processes for things and best practices so what would be really great is if you know there was a data science related i don't i want to say like a field guide or something where we know these are the things that work across the board or work 90 of the time in these kinds of problems i think right now what's happening what has happened that that information almost certainly exists in some very distributed disparate kind of in the ether on the internet across like random blog posts and uh across you know across random maybe as like tidbits in the documentation in certain frameworks and stuff right so you find those golden nuggets and you'll be like hey these people are saying that this works here and these people also said that this same thing works here maybe this is just a generally good thing to do right so one obvious thing that you could say about that is like hey standardizing your data is generally a good thing to do and you know that because you've heard people say that and you've seen that it's worked well but there's almost certainly a whole host of other kind of processes or best practices or what have you that don't just involve data pre-processing uh there might be or you know things around this orchestration and pipelining there might be things about like what's the best practice for serving up predictions how should they be done should it be done i mean i don't know there's just so many things that in that way that you really can only get by right now via exposure to those problems in exposure to those industry leaders that have actually done those specific things but really that's not the way that you grow the overall industry what you need to do is you need to disseminate that information and it really needs to be captured in some way almost like in a you know in like a wikipedia for data science where you know exactly kind of the way that those things are done i mean granted i sort of understand why that hasn't happened right like ultimately there's this belief that like hey if that information is scarce it makes what data science scientists bring to the table as inherently more valuable i mean i think to a certain extent in a very like narrow you know uh i want to keep being relevant in my job for the next however many years way yes that probably makes sense um but i think in the larger scope like scale of like hey i want to make all of data science be more productive i don't think it makes sense uh and i think that if you want to democratize data science or get more people to be interested in this stuff or to be impactful and to grow you really need to disseminate this information and i know that it's going to happen eventually but i think ultimately that this kind of consistent industry-wide creation of something like a best practices wiki or something like that i think would be really really important and then so that's the first thing that's kind of like a data science across the board kind of critique i would say about about where the challenges are i think from a uh different perspective the the other really really big challenge is really buy in by senior executives uh within uh legacy companies i think that look if you are a company that was built during the original you know hey day of the web so i would say basically if you were built from like i don't know 95 until 2010 uh you're gonna have data scientists you're going to have a commitment to data and insights and whatever because you you had to have that to survive especially when the going was really rough post the dot-com bubble uh of the early 2000s right you you had to be a principled and committed to data and insights from data in order to survive so from from those kinds of companies i don't think there's any any issue with technical buy-in so if you're gonna you know if you wind up you know working for a google or a facebook or an airbnb or or an amazon or whatever like that's all solved i don't want to say it's all solved there but like they know the data is important and they understand what the scope of the technical challenges around data management and processing all that stuff they understand that it exists they understand it's super important you i don't think that that's really fully happened uh in the same way at legacy companies so companies that were that were founded before 1995 or that weren't founded with the internet in mind as their their primary engine for creating value for those companies they really still among senior leadership they i do not think that they fully understand all of the aspects of what it means to be a fully data driven data science driven company i think there are still lots of places where you have people making decisions based on their gut based on intuition based on industry knowledge i think all of those things are super important i think you should never i'm not saying do not trust your gut i'm not saying do not use your intuition i'm not saying do not use you know your judgment that you've created and honed over however many years of being a senior leader of these companies what i am saying is is you need to understand that in order to continue to thrive and survive you need to start using data much more significantly in order to derive maximum value for your business and i do not think that they are doing that as much as they should be at these legacy companies i'm not just talking about you know about about where i work now or where i've worked in the past it's just a general thing where when you start asking asking other data scientists or practitioners or data leaders at other organizations and you ask them hey like what's your biggest challenge nine times out of ten they're going to tell you look i understand where where the investments should lie i understand what i what i need to be doing to to make these these things happen but ultimately i need buy-in i need people that are uh in in the most senior leadership you know the the c-suite executives to not only say you know every quarter we care about data science uh during their you know the board meetings but to actively actually say like hey we are going to make investments in data warehousing we are going to make investments in this monitoring hey we're going to make investments in third-party data that we're going to that we're going to purchase because we understand that the more data that we have that's relevant to our business processes the more profitable or the more successful that we will be as a business so i think that's really ultimately it's the the fact that we have basically two cohorts or at least at a minimum two different kinds of companies that are operating uh within industry today you have legacy companies that are saying that they're committed to data science but still have really not made the the full plunge and then you know the the companies that have made that that account for the vast majority in terms of what we would think of like growth and success and value creation whether it's from like a stock market perspective or whatever and we're talking about basically amazon and the post amazon companies that have been created since then that are full in data full in on technical buy-in and have as a result you know created unlocked whatever trillions in values for themselves and for the economy as a whole i think there's so much to unpack in both of these points but i think the second point that you mentioned of the lack of committed buy-in by senior executives especially in relatively legacy industries where the organizations have not really used the internet as their primary source of value creation especially when you think about the amazons in the airbnbs and the ubers of the world i think that problem the data culture problem to a certain extent uh is one of the biggest problems uh affecting data science today so i'd love to expand on that one um as a data science executive i'm sure that you had tons of experience with gaining buy-in from stakeholders non-technical ones can you walk us through some of the pitfalls data science teams or data science team leaders often encounter when trying to gain buy-in yeah so so i think there are a couple that i've noticed anyway so the first i think is one that i think in general uh people people do where they're trying to seem helpful but it actually ultimately uh can significantly hinder their long-term um success within the company and this is the idea of like it's the the the kind of the wizard claim where people ask you hey what can you do for us and uh you say well everything i can literally do everything i can i can answer every question i can uh achieve operational success anywhere you put me and so uh yes that's probably the case over a long enough timeline so if i had infinite infinite time and infinite resources i can solve any question right but that's not what a you know a senior executive is looking for what they're looking for is they're going look i have these specific things i care about and i want concrete improvements on these when you say when basically when you over promise and under deliver so my first axiom of being a data science manager is you always under promise and over deliver you never do the opposite okay so you say you can do a little but you wind up doing double that amount so that people immediately come away and pressed if you do the opposite and that fosters that buy-in right because as soon as you way over deliver for a given uh project people ask hey can you improve this one little thing you go sure oh and by the way i also did this this this and this that's great what you don't want to do is go oh yeah i can do that thing but also i can do this other thing oh and i can do this other thing oh and i can do this other thing on top of that and then you know when you have your next check-in with that senior executive you've basically gotten 10 of the way across all of those things you have nothing actually concrete to show them so basically over-promising and under under-delivering which i think is a general problem for any um what i would call like an innovation type group in an organization i think that's that that will doom your ability to get that longer term buy-in because that now they just they can't trust your word right like you say one thing that you can do x y z abc and then you wind up only being able to do a third of x a half of y and a tenth of z and oh and by the way none of them are actually a full letter right so now what the hell can i do with that right so in general doing that is really the best way uh so over promising under delivering is the best way for you to sync your ability to get technical buy-in so definitely don't do that the second thing is is part of this but it's really in the actual execution aspect and this is this is what people talk about the perfect being the enemy of the good so or i like to think of like the perfect being the enemy of the good enough or reasonable let's just use that you know like like where where you basically just say look this works well enough to where it's better than you chance if you have any kind of improvement over what came before it's good enough let's immediately start start putting the gut works around this so this can be a repeated process it can be productionized etc in general you hear this a lot right like what are the things that they're like two things you hear about data scientists data scientists always say 80 of my job is data cleaning and then the other thing that they say is when it comes to actually building a model the first 20 of the time you get eighty percent of the way of the performance and then the remaining eighty percent of the time when you're building the model you get the remaining twenty percent of the performance so what does that actually mean right if you think about that in terms of actual time it means that if you wanna get a reasonably good model and you have and you you spend the first month doing that if you want to get an even you know a 5 10 15 boost it's going to take you months and months and months of additional work right so what the hell does that mean what that means is is those months and months of additional work where you got a marginal improvements in the overall quality of the model have effectively taken they they've taken precedence over actually productionizing the model what should have happened is as soon as you got to something that was better than nothing or better than what came before no matter how limited that improvement was you should immediately start building out the pipelines the you know the the reporting the monitoring the et the automating the etl processes automating all of that other stuff to actually get it to a place where it's actually a data product right and so i think that's the second really that's the second really salient point for where where if you're always just talking when you're meeting with senior executives and you're going hey we improved our model we improved our model by by five percent and then you have another weekly meeting or let's say you're not going to get a weekly meeting with a senior executive let's say it's a monthly check-in and so month one you go hey we got to 80 and then month two you go hey we got to 85 percent and then one three you go hey we got to 87 and then 140 go we got to 89 the executive at this point is gonna be like what are these people doing you know like why isn't this actually being like eighty percent three months ago was way better in terms of in terms of you know potential revenue increases or or decrease in losses is way better than 89 now without any any actual data product built and no actual um no actual business critical things driving it right so that i think is really really really really critical if you want to get if you want to get technical solid technical buy-in from a senior leader you really really have to as soon as you meet criterion where again criterion here has to be super low where you literally just say it's better than what we had before whatever what we had before was um immediately build something on top of it build the non-ml parts or non-core ml parts of that product such that you can immediately show that this thing works we can deploy it today it's not going to be great but it's going to be better than what we had and hey that means that you're saving money and that's where as soon as the senior leader is like this is great that we didn't have anything before now this thing works that's where they're immediately gonna be like okay this is really awesome how can we get more and then at that point you can be like okay well you see we did this we can do so much more but you have to understand here are all of the obstacles that we're facing and so you basically use a quick win to then drive the more lasting the more difficult the more longer term change in that organization yeah getting getting a quick win is so essential to getting one a data culture enthusiastic like to to bringing up an enthusiastic data culture within an organization but also to get organizational buy-in because you need to be able to provide value fast otherwise there's going to be questions around the value of data science in general uh so on the flip side what are some of the best practices you found that can help ease an alignment there and ensure organizational buy-in around data projects yeah so it's basically like take everything i said and do kind of the the opposite of that with a little bit with a little bit of with a little bit of other stuff thrown in so first uh what you want to do is so so uh one one i think the most critical business related aspect of being a data scientist is operationalizing and converting a statement by a senior leader into something that's measurable right so this person says hey i want to reduce x by y percent uh or they say i want to increase revenue on this thing by this amount you have to say okay increase revenue by this amount what is tied to revenue how can we measure that tie into revenue and what aspect of that revenue generation process is the least efficient currently and how can we use ml to improve that efficiency in some way right so basically the idea is you have to you have to operationalize whatever it is that that that that senior executive um asked for convert it into something that can be measured that can be converted into tables and bits etc and then do that as early as possible okay and then and make sure that the senior executive is aware of what those criteria are and agree that they make sense in their context so for example if i go back to this original question where the senior executive said hey we want to increase uh uh our revenue in this specific field by 10 okay so we'll we'll let's just talk about churn we want to increase our our revenue for this specific product by 10 and then you look at the product and you go okay well one way you can do that is by growing your subscriber base another way you can do that is by limiting churn so you go back to the person okay you said you want to increase revenue by this much so why don't we think of a project where we actually increase the subscriber base so you tell the executive that they go no no that's not going to work we've already saturated the market there's no more new subscribers that we can tackle and our acquisition costs are going to be too high so if you if you if you get to that point quickly you can immediately say okay we're not going to start immediately going down the rabbit hole of what can we do to acquire new customers so let's look at the other thing okay what about churn what if we reduce turn by this amount and then the person goes okay yes the executive goes yes exactly so um the way that i think that we should that we should increase our revenue is by limiting churn and so now immediately you're like okay great i understand that it's a it's a it's a churn issue so let me get to tackling that so i mean in general i mean i think this is this is this is what's what's really important is that you you by ex making explicit things that are implicit in what the executive is saying drawing them out i think is really important for for at least starting on the right foot because otherwise you might you know you might assume that they're saying one thing but they're saying something totally different and you really have to bring it out of them um so so anyway so that's the first piece so this this idea of operationalizing exactly what your success criteria are as early as possible i mean basically as soon as you have that first meeting you you you make sure you understand exactly what it is that they want they that they are interested in tackling and anything that is vague is made as explicit as quickly as possible because that means you can immediately immediately get started now once you've operationalized that criterion uh what your criteria for success is the next thing and this is kind of obvious but you would be shocked at how sometimes how difficult this is is getting access to the data that you need as early as possible in a project's lifetime basically as soon as you have that first meeting you need to get access to this data whether and it doesn't it doesn't actually mean you need the full database you need some you just need something you need a sample you need uh just anything where the actual real date what the actual real data looks like it doesn't have to be a like i said like access to a production data warehouse it doesn't have to be all of the flat files that have ever existed just something because ultimately that's the only way you can really measure the true amount of effort this is going to be necessary for this project to actually become uh viable okay and the reason i say this is because the only way and this is the only way that i've ever seen anything work in an organization it's not because of i don't think it's not because of anything nefarious but it's just because people assume they know things that they don't it's that you don't know the state of that data never ever ever trust anyone's claims about cleanliness data structure data frequency um just any assumptions that they have or what they say they think they know about the data until you've done basic eda on it so exploratory data analysis on it you don't know anything about the data it's basically it could be anything they could tell you that it's in a pristine state and then you get it and like 90 of the columns have you know 50 missing values you know one column might actually have uh combined like several different data formats so like i've seen plenty of times cases where you have a time what's something that's a date is actually both a time stamp a month and a day and then sometimes it's just like things as unix time i mean there's just so many things that that that you have to check and see and the only way you can do that is by getting access to a snippet of what you're going to be working on uh as early as possible okay um because the earliest you do that the better you can understand what your true timelines are going to be um then the next the next part i think this is just the inverse of what i had said earlier so i'd said enemy uh uh uh i'm sorry the perfect becoming the enemy of the good enough what i'm what i'm saying here is is look as soon as you have something that's at some reasonable criterion stop focusing on the ml and start focusing on stop focusing on the core ml parts of your prod product and immediately allocate as much uh as many resources as you can to actually standing this thing up because although it's fun to get incremental there's like a little dopamine rush right there's a little dopamine drip that happens anytime you get a slight performance boosting your model it's not the stuff that's actually going to get your project across the finish line right what you need to do is you need to you need to be building out those uh the non-core ml parts of your project in order for it to succeed within the timelines that you've uh told people that they're gonna that they're going to have right um so anyway so uh the monitoring the visualization the pipelining stuff is the stuff that you need to build to get to the finish line so start building it as soon as you possibly can um because that's ultimately what's going to provide true value to both your business and to your to your senior stakeholders and then lastly this is just like a general uh sourcing and uh allocation and estimation uh thing that i've i've learned to do after being burned a couple of times basically you have your internal estimate of how long you think a given project will take just double that and then uh give that as your actual timeline to your senior stakeholder so if you if you think that something will take you a month and a half to do uh internally tell the senior executive that it's going to be three months and that way when you do deliver it in two and a half months or in two months uh it's seen as a you know is a huge uh improvement and like everyone's really really happy this again this is the idea of you always want to uh uh uh under promise and over deliver doing that is the way that you get uh uh senior executives to buy in um uh effectively and consistently on the on the projects that you that you begin yeah i think this is really solid advice and especially when applied to organizations who are still maturing their data science competencies i would assume for example access to data is not a major problem at the ubers and airbnbs of the world but this is quite a ubiquitous problem throughout the industry and uh yeah i think this is super useful now with that in mind i would like to segue to data democratization because i think uh and really important aspect of data science is not only producing data products but really equipping the rest of the organization with the ability to work with data themselves um how do you view the importance of democratizing data uh for data science teams as as a strategic imperative yeah i mean i think i think the the question almost answers itself right like we we know that that that um you know uh data is really important data is what's going to uh unlock the largest amount of business value when applied correctly so if you can get more people in your organization that aren't pure data scientists or or data engineers or data analysts to have access to that data to start thinking about it the more successful that you're going to become so i mean i think it's absolutely critical to provide people outside of data science organizations the tools to be able to uh uh what i like to call phishing for themselves right so uh if you can if you do not have an expert background in a in a you know a scripting programming language uh in statistics or in you know in in i don't know in sequel or something it would be really really great if you could still get to uh you know even half of the kinds of questions that you want to be able to answer but you currently can't because you just don't have the necessary skills right and ultimately like having data scientists do these things um is is a bad use of resources right like like getting people getting people in marketing or in or in um i don't know in product or in some other uh part of your organization answers to questions that they have for them that they should be able to get themselves if the data was in a reasonably structured enough or or you know is it was was was placed somewhere where it could be easily accessed by non-technical people it would just impact them so so much more right like i said data scientists are expensive and they spend uh the vast majority of their time organizing and cleaning data much and much less of that time actually mining right so if you think about it if you can get even a small fraction of your organization uh but more than currently are like this to be able to either you know i think my dream would be if everyone if everyone could use sql in the same way that you they use excel right if they could get if they could even get to that point i mean the entire organization would benefit just so immensely right uh their abilities the organization's abilities to tackle questions quickly i would bet would grow by like an order of magnitude um so yes i think data done democratization is super super important yeah 100 and even for example internally at datacamp we have a centralized data warehouse that on top of it you can have like metabase or some sort of connection sql database and most most people at datacamp know how to use sql and that has really enabled everyone to answer their data questions if you want to if you're on the sales team you want to see who's the account that has the highest sales you can check that out immediately if you're on the marketing team you want to optimize spend somehow all of this analysis is really done immediately through metabase and uh it has really changed like any organization there are so many things that we can do to become more data mature but uh it has really changed how we interact with data in that sense well this is highly use case or industry relevant what are some of the low hanging fruit that you find data science teams can quickly implement today to further democratize data science and to as you said give people the ability to fish for themselves yeah so i think the the the quickest things that they can do uh are two things one is you provide and aggregate a data view um at the level that uh business analysts right would typically see data and surface it up to people so they can plug it away at it in a dashboard like environment so something like a tableau i think that would probably be the first thing so you basically you figure out um some reasonable level of data granularity you surface up a table at that data granularity you provide updates to that table basically you could it could be a materialized view right you just have a materialized view that's constantly being updated with fresh data at a um at a set aggregation level and then you surface it up either in a dashboard or uh uh really i think the the other way is something like what you just talked about like a metabase i know there are other uh tools that provide effectively something like an excel light connector on top of the data warehouse so if you can even do something like that i think that would be that would be a very easy quick win i mean i think if you can get something in a place where people in the organization are reasonably comfortable with something similar to that so providing something like an excel like product but that actually connects to a data warehouse that has like i said you know terabytes of data i think that that would that would be one one very very quick way to unlock that value now the unfortunately you will have to do some kind of kind of socialization around the do's and don'ts for that right like so i imagine you know if you have people that are that are you know you have a data warehouse and people are using sql that's well and good but you have to just let them know like if they're looking at you know if they're looking at you know the top x in y so top sales person or the top account uh across the organization or like who's i don't know who's generated the most um amount of sales over the past month or whatever it is that's great but for example if they're trying to get larger segments of the data knowing that there are certain things you shouldn't do because it could brick your database is really important so right so for example like people doing a select star without a limit for example or things that or uh you know like basic kind of um uh pitfalls that they should avoid when doing um uh one when performing these kinds of queries right and i assume that those same kind of things would happen even if you had an excel like connector but i think those are the the the two things you can do one provide a reasonably uh uh a reasonably aggregated view that can then be accessed by people and then once you have that reasonably uh reasonably aggregated view have something like an excel like connector that people can connect i'm gonna actually now i've i've never used metabase but now i'm very interested in what this is um so i'm gonna have to check that out thank you for that intel uh um but but i know there are other yeah i know there are other tools like that that provide basically a layer a query layer on top of your data warehouses so i would i would suggest that as well yeah exactly and your experience as well what do you think are the obstacles standing in the way of enabling really mature robust data democratization uh and what do you think are some of the tactics that can be alleviated there yeah so i mean i think uh this is gonna be my my plug for data camp here in this in this podcast i think the first thing that i think the first thing that's uh that i would say is like look there's there's fundamentally a skills disconnect between what's necessary in the modern in the modern database um company and what people actually possess so i think the first thing that people should learn is just learn sql take some courses on data camp and learn sql if you don't want to use data camp go somewhere else and learn sql but learn some sql like i think that is uh absolutely the most the the best way for a person to level up their data abilities uh uh nowadays if you can learn sql it's the excel it's older than excel more powerful than excel um at least from a data munching perspective maybe not from uh from from a bunch of other perspectives but i think that's really important i think anybody that is comfortable in excel should learn to become comfortable in a sql environment um and i mean you think about a business analyst i mean they live and breathe excel right they know it really well then i do all sorts of crazy i mean i've seen people do stuff in excel that i was just like what is this this look this this is not excel this is like some weird abstract like you have like scripts in here i mean there's like where the cell where a cell has literally enough text inside of it to fill an entire page and it's like dude you're literally yeah you're you're literally like like creating a mario clone inside of this excel spreadsheet like like what are you doing so so you clearly have if you can create you know if you can create tetris inside of an excel window you can you learn to use sql okay you are very very good at you are very very good at manipulating data it's a little bit different way to think but you should totally be able to do that right um so and in general that's kind of that's kind of the way that that that you're going to have to mature as a just technically yourself and just your entire data org as a whole because like data now lives in the cloud right like there's no the the time of people passing around excel spreadsheets and saving them to some like j drive somewhere i mean that's going to keep existing but that's not where you're going to be that's not where your golden records are going to live if you as an organization are still living in a world where everything that all of your master data sits across 300 or 500 different excel spreadsheets which is a really i mean maybe that's sustainable in the near term but that's just not going to be sustainable over the longer term so put that that stuff's going to be in a cloud data warehouse it's going to be in a in a cloud database of some sort in order to be able to access it in order to be able to perform any kind of a non um trivial query on it any kind of a non-trivial aggregation on it you're gonna need to use sql so you might as well learn right so i think that's just absolutely important i think you can do that in lots of places i know the data camp has some excellent uh sql courses so uh those of you out there that want to learn more you should totally check that stuff out i didn't uh write any of them so this is not a plug for any of my courses um but uh anyway so that's i think the first uh the first thing i would say then the next is uh i think a uh it's more of a um uh a description of the kind of the entire data enterprise as a whole and it's really this lack of understanding or an appreciation of how the way in which whatever your source data is wherever it's coming from those signals how they're collected or stored impacts how quickly or easily a given question can be answered so i think this is more of a this is this is more of a a problem again for senior executives where they just assume like hey we have this data right so why can't you just answer this question right like we we have the data we have every every possible interaction that's ever happened on our website or on our app or on our service whatever it is why can't you just tell me how many extras are and why and the reason that i can't tell you that is because of the way that one the data is stored or two the day the way that the data was collected and so having an appreciation that hey if for example uh the way that we store sessions is separate from the way that we store users means it's very difficult to figure out how many unique users were on your platform over the past month um it means that like if senior executives don't appreciate that if they don't understand that hey like the way this data is actually sourced makes it so that what you think is a simple question to answer is actually a kind of a difficult question to answer it's not nearly as trivial as you thought so basically this this this you know getting getting people and i've definitely had to do some of this myself and it's paid dividends because they then can push back on others and be like why the hell was this built like this why did we not you know do it this way and that immediately i think it it basically makes it just makes the um accountability for how data is processed cleaned how it's stored a lot more visible a lot more transparent and when you identify why there are these kinds of breaks in the system uh uh or or gaps in the system it immediately makes it so that everyone has to talk to each other a lot more and the more that they talk the more ultimately hopefully the issues that they're having are going to be solved so i mean i think i think that's that's really the the other the second obstacle to this data democratization it's it's really this this notion of like hey being able to understand how the data was sourced and being able to to effectively evangelize how the data is structured immediately lets people lets people understand and know two things one is the they have more of an appreciation for how difficult some things are but they also now uh also have a much better understanding of like hey if something takes a while it means there's probably some significant gaps in the way that things are happening so i mean i know this isn't really uh uh a point on democratization but i think it's more of like a point on on democratization of of the challenges around data as opposed to data access if you will no i i think democratizing data access and giving providing extreme context around how data is sourced and how it impacts a given function is also super important in the the formula of data democratization sergey it was a huge pleasure chatting and before we let you go do you have any other call to action to make uh you know i um one i wanted to say thank you very much adele this was really really lovely uh this was a a great way to kind of for me to relive my entire data journey up to this point so uh this was this was this was cool um uh i think really the only call to action i really have is uh i think everyone should you know practice a little bit more humility when it comes to data science i think i mean this stuff is hard and i think that everyone most people have their they're kind of their best uh um or the your companies or whatever it is that you're doing that you have the best interests in mind i don't think everyone is out there nefariously trying to ruin data projects uh so i think practicing some humility when it comes to both you know practicing data science and also evangelizing for data science is really really important and i think as part of that like when you're when you're humble it means that you always have room for growth you have you have more opportunities to learn i think you know ultimately um learning is the way that that you make both the largest impact within within any organization that you work for but it also makes for a much more meaningful life i mean i've i've enjoyed learning uh about new techniques news new uh frameworks and all of that stuff and i think as long as you keep learning uh as a data scientist or really just across your entire life uh i think you'll find that your work and the things that you do are gonna become a lot more meaningful so that's that's my hope for everyone stay humble and stay stay learning that's awesome and i would highly recommend that you check out extreme gradient boosting on data camp taught by sergey fokelson and with that in mind thank you so much sergey thank you thank you again adele that's it for today's episode of dataframed thanks for being with us i really enjoyed sergey's insights on how data science has evolved over the years and what is still remaining to really scale the impact of data science across organizations and industries at large if you enjoyed this episode make sure to leave a review on itunes our next episode will be with bar moses ceo of monte carlo data on the data quality challenges data teams face and the importance of data observability and reliability i hope it will be useful for you and we hope to catch you next time on data framedhello this is adele neme from datacamp and welcome to dataframed a podcast covering all things data and its impact on organizations across the world one thing we're looking forward to covering in more detail on the podcast is not only the latest insights on how data science is impacting organizations today but how the field has evolved and is evolving towards democratizing data science for all this is why i'm excited to have sergey fogleson on for today's episode sergey began his career as an academic at dartmouth college in hanover new hampshire where he researched the neural bases of visual category learning and obtained his phd in cognitive neuroscience after leaving academia sergey got into the rapidly growing startup scene in the new york city metro area where he has worked as a data scientist in digital advertising cyber security finance and media currently he's the vice president of data science and modeling at viacom cbs where he leads a team of data scientists and analysts that work on a variety of awesome use cases in this episode sergey and i discuss his background how data science has evolved since he got into the field the major challenges he thinks data teams and professionals face today his best practices gaining buy-in from business executives on data projects and his best practices when democratizing data science in the organization and more if you want to check out previous episodes of the podcast and show notes make sure to go to www.datacamp.com community podcast sergey i'm really excited to have you on the show i've been excited to have this chat on the state of data science your experiences leading data teams and democratizing data science but beforehand can you please give our listeners a background on how you got into data science sure would love to thank you for having me uh dell i'm really excited uh also to speak with you about all of this stuff so my academic background is in a i'm cognitive neuroscience so i got my graduate degree in cognitive neuroscience applying ml algorithms to functional neuro imaging data so basically what this means is put people into large scanners record their brain activity and then try to decode what's actually happening in their brains using machine learning algorithms uh and what i knew was probably about halfway through my phd i knew i didn't really want to stay in academia and i knew i wanted to work on interesting data related or data intensive problems and uh when i was at that point in my uh phd so this is around 2010 to i heard about this thing that that people were talking about called big data uh they didn't really have a term for data science at this at this moment in time and so i just knew that there was this field where you could you could sort of use you know still in its infancy but you could sort of use the same kinds of algorithms that i was using for neural imaging work but applied to uh real world data sets so data sets in in uh advertising in finance uh in quantitative analysis uh kind of all over the place and so basically i started looking into this stuff started reading about it and in my last year i really made a hard push to try to get into the industry and i wound up being able to kind of land a job in the world outside of academia and haven't really looked back since so what were some of the earlier data science projects that you worked on and how has that shaped your path leading data science to viacom yeah uh i've had i i would like to think that i've had pretty varied experiences but maybe not i think they're reasonably eclectic uh so i started my kind of the very beginning of my career i worked for a digital advertising startup and there the uh the big two problems i worked on one was a classification problem and i think it's still a pretty relevant problem i don't think this problem has really been solved yet and it's the idea of taking ip addresses and trying to understand what kind of a place that ip address represents so for example um is this an ip address associated with a home is this ip address associated with an airport or a starbucks or a you know some other business is it an educational uh ip address etc so there is some metadata associated with that information uh but it's not 100 accurate um so what you can do is you can take the signals that are coming out of that ip address uh to make probabilistic inferences about whether you think it's a home or not and that was really important for the work that we were doing because the way that we were building uh the main product that this company was selling is called a device graph basically it tells you whether any two devices belong to the same household or not being able to do that and being able to build those those those links across devices is really critical to understand whether something is a home or not whether that that device is living or is being seen within a home-based environment or not so that was kind of um one of the first projects i had i also worked a little bit on this on this graph building problem that i just briefly mentioned earlier so the idea here is again you're trying to figure out whether two distinct devices two two phones an iphone in a tablet or an android phone and a smart tv for example whether they uh belong to the same person or to the same household or not and again this relies on uh some uh some network analysis problems as well or techniques rather and then on um really thinking about how to be able to do this at very very large scales so back then this was again this was like 2014. uh there really weren't a lot of uh large-scale data analysis frameworks like spark was basically at like 0.1 or something i mean it was a completely new project so it was just very difficult to and interesting to tackle these kinds of problems that that involved working with data at scale so that was kind of my first my first foray into data science and then after that um i moved into uh cyber security so i worked at the cyber security startup for a little while and there the the most important problem that i tackled was really what we called hack predictions so the idea was hey given a company's cyber security footprint so uh you know the number of ip addresses that they have exposed to the public internet uh if you can snoop for example and see uh what kinds of uh software they're running on computers in on those ip addresses so on servers or on personal computers etc uh you can actually see if the software is all up to date we know that you know if that software is not up to date it can be hacked in various different ways but the idea is as you take all of these kinds of signals and then you assign a probability score to what the likelihood is that this company's going to get hacked within six months or within a year or within two or three years uh so we called that a hack prediction and then i moved and uh worked for a small data consultancy where we actually worked for a large investment bank and we worked on what's called an automated account reconciliation problem so this is not particularly attractive from like a data analysis of perspective but it's actually super critical from a back office perspective the idea here is you have two distinct accounting systems uh they occasionally do not line up with each other they need to be what's called reconciled and you need to basically assign a likelihood that they actually need to be manually reconciled or they can be uh basically dealt with by other downstream automated systems so this is almost like a health check uh that happens at one point in this massive reconciliation process that happens every day within within i would say every major investment bank in in in the world where you're you're trying to basically make sure that your books line up at the end of the day and this was something that had been done by thousands of people thousands at this when we first started this project there was over a thousand people that were actually hired explicitly to do this so to manually check all of these records and so we did was we basically took years and years worth of their manual checks and just put a machine learning algorithm on top of that right it was we built this ensemble model and you could say like look given this metadata associated with these trades what's the likelihood that they need to be reconciled manually or or basically surfaced up to a manual reconciler versus just pass it through the system anyway long story short using machine learning on past human performance actually worked surprisingly well we wound up being able to uh automate basically 90 of the reconciliation process in this way so only the most difficult to reconcile records wound up being actually validated by human beings which meant that those people that were hired to do this can now do other more meaningful more impactful stuff so i think that was that was an overall win yeah i think all of these projects that you've engaged in are super useful in the sense that it gives you this breadth of experience in this data space and this is one thing that i really want to pick your brains on is really reflecting on how data science has evolved over the past decade or so for example you mentioned that some of the problems you were working on spark was still in 0.1 0.2 there weren't really these mature data analysis frameworks i'd love to pick your thoughts on what are the some of the major changes that you've seen occur over the past decade and how do you really see it playing out within data science teams today yeah that's a i think that's a a really interesting thing to to talk about so when i first started again so i started in 2013 uh everything was on hadoop so basically my first work i was working in pig jobs then i was working in hive and then i was also working using a framework that came out of twitter called scalding so it was basically uh hadoop but using scala so writing scala jobs so hadoop basically only now exists from the way that i see it in the industry as a legacy system people are not going out and saying if i'm going to build a new state-of-the-art data architecture i'm going to use hadoop i don't know anybody that or any company that actually makes that a flashy thing that they describe uh now what would we have we have uh spark is basically completely built out and is almost completely taken over uh data science from like a data processing etl and even um ml kind of perspective so there's there's not really any need to think about about data from like a map reduced perspective uh and in fact like i haven't touched a mapreduce job in probably i want to say three and a half to four years or something uh and now so before you have most of your data in flat files right so again when i was working when i was working at uh uh the digital advertising startup uh we basically had you know petabytes of data in flat files that were either compressed csvs or back then what was this revolutionary new data format called parquet which now again is super standard across the industry but nowadays it's actually so cheap i mean it's still reasonably expensive if you have really really massive data sets but it's so cheap to put terabytes of data into structured data warehouses uh that you can actually query data sets that are on the scale of tens or maybe even hundreds of terabytes i don't i don't i haven't heard of anybody having petabyte scale data warehouses at least within my industry because we don't really have petabyte scale data yet but i assume that there's probably somebody in finance or in especially in web scale companies that are probably dealing with petabyte scale data warehouses where you can run basically structured sql queries that'll give you results within at most you know five minutes or something which is just completely unheard of uh eight years ago when i when i started so that's i think the first really big difference i think that that people have basically moved there's still people working with flat files especially with unstructured data right like you can't just put unstructured data into a relational database into a sql database so if you have text or you have images i think it's very hard to still work with those in a more structured environment you can definitely put the metadata associated with them into a database but i don't think you can actually put the raw the raw data itself into a database so there is a place still for unstructured data but for the most part if your data is tabular and structured it's going to be in a massive data warehouse it's not really going to be in a flat in flat files anymore unless it's completely unprocessed completely raw and you're just using putting it there for like legacy purposes or for for safekeeping purposes the next big thing that's that i think has happened and this is really the actual revolutionary thing i think the most revolutionary thing is really um but again it's not the sexy stuff it's the orchestration data pipelining frameworks for actually being able to automate data jobs on some periodic basis so we're talking about you know uh frameworks like luigi flame frameworks like airflow i mean there are obviously other ones um airflow is what i use now uh fairly regularly or folks on my team uh work or use fairly regularly but that whole idea of basically cron for data science again that didn't really mean there were people that were basically using cron uh in legacy you know uh uh linux or or unix-based frameworks for data pipelining for etl processes uh but they they really hadn't come into their own at that moment in time they were still in their infancy now basically everyone has some kind of a data uh pipelining framework that they use um on across some kinds of jobs within their within their um data teams right that i think is really really important that's really the stuff that allows you to increase the velocity of your data workflows right instead of having to figure out how to automate that stuff you can basically just build it once forget about it it's scheduled it's run there's alerting there's error reporting all of that stuff is kind of baked in you have a you know you have a front end all of that stuff it's just it's really really incredibly valuable okay so that's that's kind of the gut works or skunk works revolution that has happened in data science then on the more ml standard side of things there's the fact that we now have uh way more machine learning frameworks uh and the vast majority of common machine learning algorithms do not have to be written from scratch so again i'll i'll come back to my first gig working in digital advertising uh one thing we had to do was basically do uh connected components analysis so after you've built your graph you need to understand what the size is of all of your basically many connected clusters your households and the problem is is when you're dealing with a graph that contains hundreds of millions or billions of edges that was a non-trivial thing to actually be able to compute so i remember we actually had to write our own connected components algorithm on top of scala so it was like you wrote a scalding job that created this connected components graph and that took you know that was basically probably three months of work or something or doing that stuff now you take a network analysis package uh an off-the-shelf package and it's just baked in and it can handle graphs with hundreds of millions of edges and hundreds of millions of nodes without a problem uh other things so again when i was starting out i mean there was really only like one implementation of gradient boosting that i remember seeing and then all of a sudden basically everyone has a gradient boosting implementation and most uh gradient boost implementation frameworks now have gpu options and they're really really fast and they're really really robust i mean the point i'm trying to make here is that before you had it you had machine learning algorithms that you basically had to implement from scratch or they were really really difficult to get them up and running at scale now it's not only that you don't have to implement them from scratch right i mean you have it you have implementations in various languages they're very very fast there are robust communities across each of these ml frameworks and in many cases there are gpu versions of things so now they're way way faster than they than they used to be so again i mean you just have open source frameworks there's lots of them they're much much faster than they used to be and they're much more extensible than they used to be so i think there's been a crazy revolution there but again i don't know that that is nearly as important as i think the skunk works stuff that i talked about earlier the next things i think are visualization and explainability so uh when i started again there were very very limited explainability methods for non-linear algorithms so for linear algorithms common thing you can do right is you just look at you just look at regression coefficients and that gives you basically the full story when it comes to nonlinear methods it's much much harder to say what the actual impact of a given feature is on a specific prediction but that's changing right so now we have some very powerful explainability methods so i can think of chap or shapley value explanations we have lime which works for it's more of like a local based method for explanations for visual problems or problems in image recognition and these algorithms really have foundationally allowed uh practitioners to quickly see where signal is uh in your in your feature space and where it isn't right ultimately this this this really kind of it has significantly accelerated how quickly you can iterate on creating new features feature engineering right in ways that we really couldn't do in the kind of the non-linear algorithm space so again line shaft neither of these things existed when i when i started as an aside for that but i think something that is much bigger now at least within the past like i would say year to two years is this notion of feature stores so the idea here is that is that you can actually create basically databases or tables that contain the latest versions of the features that you're using for your machine learning algorithms and you can quickly update them you can quickly source them instead of having to recreate them in some process uh you can almost think of them as like yeah it's like tables that you regularly update with the latest versions of whatever your your features you're using to feed downstream models and what's cool about feature stores is that you can you know you can use them across multiple different uh different model spaces right if you have a new if you have a new data scientist that comes in and you're like hey you're going to be working on this churn model you can immediately point them to a place where all of the latest greatest features exist for that turn model or from other models that have been built in the past and so this person can be very quickly brought up to speed with what seems to be working in the problem space that you're that you're attempting to tackle right um that again future stores were not something that existed when i when i started yeah thank you so much for this really interesting rundown of all of that perspective that you've seen evolved in data science over the years uh and i think really like a broad theme that you're talking about here is really the move from experimentation to operationalization and productionization right and definitely agree with you as well on the skunk revolution basically that you talked about about how orchestration platforms and really the move to centralized data warehousing at least for tabular data has really changed the game for data scientists and their ability to provide value quickly and to scale their work and i think this is a testament to how much the tooling stack for data scientists over the past years has evolved and i'd love your insights then on the flip side on where you think there is still room for improvement and where do you think that the data tooling stack is headed yeah so i i think we're really the the largest place for improvement isn't really on the side of you know greater better faster machine learning algorithms i think at this point i mean yes you can always have a better you know neural network model that captures 0.1 more performance for this specific task i think that's always going to be the case and i think there's always a place for that but i think it's the operationalization bit that still has the most legs to really grow and mature across all aspects of data science i mean i i talked about you know i talked about feature stores and these orchestration frameworks and pipelining frameworks but they're still i mean they're they they work but in in many respects they're still fairly brittle right the automated model performance monitoring isn't still really where i think it would be really great that it could get to right so another thing i think a place where there could still be some significant improvement or maybe i just i haven't found i haven't seen the right product but seamlessly updating uh different aspects of the ml pipeline right so if you think of like all aspects of the ml pipeline as as being as modular as possible so for example everything from data loading pre-processing feature generation or just the kind of the standard when you think of the jpeg of the data science uh typical data science pipeline all of those parts you treat them as if they're like individual completely isolated boxes that can just be popped in and popped out but they really can't i mean the current issues are that you know if you want to change uh or add a new data source to some to some to some process or tweak a data source for some process it's really not seamless it's not as simple as like hey point to this end point and then your model will just magically understand how this data is structured what needs to be done to convert it etc i mean you basically have to it's it's you basically have to touch every single box in that uh jpeg that you have for your typical data pipeline and change it in some specific way so it's really not completely modular in the way that you know we think of pure modularity and there's there are good things to that they're bad things to that but i think there's still some some uh some some improvements that can be done uh in that perspective and then in general more around this this modularity stuff but really more on the real-time side of things so right now uh you can you can get a lot of stuff done for batch machine learning models so what what i mean by that is you basically have a machine learning model that you build at some point time point in time a then it runs for some specified amount of time and while it's running you're getting new data coming into the system and after that time has passed you basically recreate a new model that you then reinsert at some pre-specified time hopefully at a time when it's not system critical that the model is performing at 100 capacity so you basically take it out and you put a new one in so this is what i think of as like batch model building and data science um the real time stuff is much trickier right so there is some you know there there is work there are algorithms uh that that do uh real-time uh updating right so you can do real-time gradient descent updates you can do some real-time updates on coefficients for you know in linear models whatever um but what i'm talking about is actually what if for example you could in real time add in a new feature or take out a feature that isn't performing currently the way you have to do that is you have to do it as if it's a batch process right you basically have to turn off that old machine learning model and create a new machine learning model where you remove the specific feature or add a new feature do whatever it is that you're doing and then put that in so obviously you can try to make that as seamless as possible inseam and make it seem as though it was the same model but really it's not right and so ultimately i think this idea of being able to in real time modify uh uh modify machine machine learning algorithms uh in in certain use cases that might be very very business critical for cert for for certain businesses thankfully that's uh that's not nearly as business critical in in some of the in many of the arenas that that my team uh tackles data science uh projects but but i can see that being being an issue um so i think basically to a long story short i think it's it's this combination of like hey the the pipelining and orchestration stuff still has lots of legs um the and then real-time uh model updating or real-time changes uh both across the pipeline and uh uh at the very end where we're actually talking about the model itself i think there's there's really lots of improvement that can be done there and with this evolution in mind by covering the uh relatively technical limitations that are still present within data teams today what do you think are some of the other biggest challenges affecting data science teams yeah so i i think there are uh two kind of overarching non-technical but still very important challenges that need to be tackled so i think the first one is just a lack of consistent kind of industry-wide processes for things and best practices so what would be really great is if you know there was a data science related i don't i want to say like a field guide or something where we know these are the things that work across the board or work 90 of the time in these kinds of problems i think right now what's happening what has happened that that information almost certainly exists in some very distributed disparate kind of in the ether on the internet across like random blog posts and uh across you know across random maybe as like tidbits in the documentation in certain frameworks and stuff right so you find those golden nuggets and you'll be like hey these people are saying that this works here and these people also said that this same thing works here maybe this is just a generally good thing to do right so one obvious thing that you could say about that is like hey standardizing your data is generally a good thing to do and you know that because you've heard people say that and you've seen that it's worked well but there's almost certainly a whole host of other kind of processes or best practices or what have you that don't just involve data pre-processing uh there might be or you know things around this orchestration and pipelining there might be things about like what's the best practice for serving up predictions how should they be done should it be done i mean i don't know there's just so many things that in that way that you really can only get by right now via exposure to those problems in exposure to those industry leaders that have actually done those specific things but really that's not the way that you grow the overall industry what you need to do is you need to disseminate that information and it really needs to be captured in some way almost like in a you know in like a wikipedia for data science where you know exactly kind of the way that those things are done i mean granted i sort of understand why that hasn't happened right like ultimately there's this belief that like hey if that information is scarce it makes what data science scientists bring to the table as inherently more valuable i mean i think to a certain extent in a very like narrow you know uh i want to keep being relevant in my job for the next however many years way yes that probably makes sense um but i think in the larger scope like scale of like hey i want to make all of data science be more productive i don't think it makes sense uh and i think that if you want to democratize data science or get more people to be interested in this stuff or to be impactful and to grow you really need to disseminate this information and i know that it's going to happen eventually but i think ultimately that this kind of consistent industry-wide creation of something like a best practices wiki or something like that i think would be really really important and then so that's the first thing that's kind of like a data science across the board kind of critique i would say about about where the challenges are i think from a uh different perspective the the other really really big challenge is really buy in by senior executives uh within uh legacy companies i think that look if you are a company that was built during the original you know hey day of the web so i would say basically if you were built from like i don't know 95 until 2010 uh you're gonna have data scientists you're going to have a commitment to data and insights and whatever because you you had to have that to survive especially when the going was really rough post the dot-com bubble uh of the early 2000s right you you had to be a principled and committed to data and insights from data in order to survive so from from those kinds of companies i don't think there's any any issue with technical buy-in so if you're gonna you know if you wind up you know working for a google or a facebook or an airbnb or or an amazon or whatever like that's all solved i don't want to say it's all solved there but like they know the data is important and they understand what the scope of the technical challenges around data management and processing all that stuff they understand that it exists they understand it's super important you i don't think that that's really fully happened uh in the same way at legacy companies so companies that were that were founded before 1995 or that weren't founded with the internet in mind as their their primary engine for creating value for those companies they really still among senior leadership they i do not think that they fully understand all of the aspects of what it means to be a fully data driven data science driven company i think there are still lots of places where you have people making decisions based on their gut based on intuition based on industry knowledge i think all of those things are super important i think you should never i'm not saying do not trust your gut i'm not saying do not use your intuition i'm not saying do not use you know your judgment that you've created and honed over however many years of being a senior leader of these companies what i am saying is is you need to understand that in order to continue to thrive and survive you need to start using data much more significantly in order to derive maximum value for your business and i do not think that they are doing that as much as they should be at these legacy companies i'm not just talking about you know about about where i work now or where i've worked in the past it's just a general thing where when you start asking asking other data scientists or practitioners or data leaders at other organizations and you ask them hey like what's your biggest challenge nine times out of ten they're going to tell you look i understand where where the investments should lie i understand what i what i need to be doing to to make these these things happen but ultimately i need buy-in i need people that are uh in in the most senior leadership you know the the c-suite executives to not only say you know every quarter we care about data science uh during their you know the board meetings but to actively actually say like hey we are going to make investments in data warehousing we are going to make investments in this monitoring hey we're going to make investments in third-party data that we're going to that we're going to purchase because we understand that the more data that we have that's relevant to our business processes the more profitable or the more successful that we will be as a business so i think that's really ultimately it's the the fact that we have basically two cohorts or at least at a minimum two different kinds of companies that are operating uh within industry today you have legacy companies that are saying that they're committed to data science but still have really not made the the full plunge and then you know the the companies that have made that that account for the vast majority in terms of what we would think of like growth and success and value creation whether it's from like a stock market perspective or whatever and we're talking about basically amazon and the post amazon companies that have been created since then that are full in data full in on technical buy-in and have as a result you know created unlocked whatever trillions in values for themselves and for the economy as a whole i think there's so much to unpack in both of these points but i think the second point that you mentioned of the lack of committed buy-in by senior executives especially in relatively legacy industries where the organizations have not really used the internet as their primary source of value creation especially when you think about the amazons in the airbnbs and the ubers of the world i think that problem the data culture problem to a certain extent uh is one of the biggest problems uh affecting data science today so i'd love to expand on that one um as a data science executive i'm sure that you had tons of experience with gaining buy-in from stakeholders non-technical ones can you walk us through some of the pitfalls data science teams or data science team leaders often encounter when trying to gain buy-in yeah so so i think there are a couple that i've noticed anyway so the first i think is one that i think in general uh people people do where they're trying to seem helpful but it actually ultimately uh can significantly hinder their long-term um success within the company and this is the idea of like it's the the the kind of the wizard claim where people ask you hey what can you do for us and uh you say well everything i can literally do everything i can i can answer every question i can uh achieve operational success anywhere you put me and so uh yes that's probably the case over a long enough timeline so if i had infinite infinite time and infinite resources i can solve any question right but that's not what a you know a senior executive is looking for what they're looking for is they're going look i have these specific things i care about and i want concrete improvements on these when you say when basically when you over promise and under deliver so my first axiom of being a data science manager is you always under promise and over deliver you never do the opposite okay so you say you can do a little but you wind up doing double that amount so that people immediately come away and pressed if you do the opposite and that fosters that buy-in right because as soon as you way over deliver for a given uh project people ask hey can you improve this one little thing you go sure oh and by the way i also did this this this and this that's great what you don't want to do is go oh yeah i can do that thing but also i can do this other thing oh and i can do this other thing oh and i can do this other thing on top of that and then you know when you have your next check-in with that senior executive you've basically gotten 10 of the way across all of those things you have nothing actually concrete to show them so basically over-promising and under under-delivering which i think is a general problem for any um what i would call like an innovation type group in an organization i think that's that that will doom your ability to get that longer term buy-in because that now they just they can't trust your word right like you say one thing that you can do x y z abc and then you wind up only being able to do a third of x a half of y and a tenth of z and oh and by the way none of them are actually a full letter right so now what the hell can i do with that right so in general doing that is really the best way uh so over promising under delivering is the best way for you to sync your ability to get technical buy-in so definitely don't do that the second thing is is part of this but it's really in the actual execution aspect and this is this is what people talk about the perfect being the enemy of the good so or i like to think of like the perfect being the enemy of the good enough or reasonable let's just use that you know like like where where you basically just say look this works well enough to where it's better than you chance if you have any kind of improvement over what came before it's good enough let's immediately start start putting the gut works around this so this can be a repeated process it can be productionized etc in general you hear this a lot right like what are the things that they're like two things you hear about data scientists data scientists always say 80 of my job is data cleaning and then the other thing that they say is when it comes to actually building a model the first 20 of the time you get eighty percent of the way of the performance and then the remaining eighty percent of the time when you're building the model you get the remaining twenty percent of the performance so what does that actually mean right if you think about that in terms of actual time it means that if you wanna get a reasonably good model and you have and you you spend the first month doing that if you want to get an even you know a 5 10 15 boost it's going to take you months and months and months of additional work right so what the hell does that mean what that means is is those months and months of additional work where you got a marginal improvements in the overall quality of the model have effectively taken they they've taken precedence over actually productionizing the model what should have happened is as soon as you got to something that was better than nothing or better than what came before no matter how limited that improvement was you should immediately start building out the pipelines the you know the the reporting the monitoring the et the automating the etl processes automating all of that other stuff to actually get it to a place where it's actually a data product right and so i think that's the second really that's the second really salient point for where where if you're always just talking when you're meeting with senior executives and you're going hey we improved our model we improved our model by by five percent and then you have another weekly meeting or let's say you're not going to get a weekly meeting with a senior executive let's say it's a monthly check-in and so month one you go hey we got to 80 and then month two you go hey we got to 85 percent and then one three you go hey we got to 87 and then 140 go we got to 89 the executive at this point is gonna be like what are these people doing you know like why isn't this actually being like eighty percent three months ago was way better in terms of in terms of you know potential revenue increases or or decrease in losses is way better than 89 now without any any actual data product built and no actual um no actual business critical things driving it right so that i think is really really really really critical if you want to get if you want to get technical solid technical buy-in from a senior leader you really really have to as soon as you meet criterion where again criterion here has to be super low where you literally just say it's better than what we had before whatever what we had before was um immediately build something on top of it build the non-ml parts or non-core ml parts of that product such that you can immediately show that this thing works we can deploy it today it's not going to be great but it's going to be better than what we had and hey that means that you're saving money and that's where as soon as the senior leader is like this is great that we didn't have anything before now this thing works that's where they're immediately gonna be like okay this is really awesome how can we get more and then at that point you can be like okay well you see we did this we can do so much more but you have to understand here are all of the obstacles that we're facing and so you basically use a quick win to then drive the more lasting the more difficult the more longer term change in that organization yeah getting getting a quick win is so essential to getting one a data culture enthusiastic like to to bringing up an enthusiastic data culture within an organization but also to get organizational buy-in because you need to be able to provide value fast otherwise there's going to be questions around the value of data science in general uh so on the flip side what are some of the best practices you found that can help ease an alignment there and ensure organizational buy-in around data projects yeah so it's basically like take everything i said and do kind of the the opposite of that with a little bit with a little bit of with a little bit of other stuff thrown in so first uh what you want to do is so so uh one one i think the most critical business related aspect of being a data scientist is operationalizing and converting a statement by a senior leader into something that's measurable right so this person says hey i want to reduce x by y percent uh or they say i want to increase revenue on this thing by this amount you have to say okay increase revenue by this amount what is tied to revenue how can we measure that tie into revenue and what aspect of that revenue generation process is the least efficient currently and how can we use ml to improve that efficiency in some way right so basically the idea is you have to you have to operationalize whatever it is that that that that senior executive um asked for convert it into something that can be measured that can be converted into tables and bits etc and then do that as early as possible okay and then and make sure that the senior executive is aware of what those criteria are and agree that they make sense in their context so for example if i go back to this original question where the senior executive said hey we want to increase uh uh our revenue in this specific field by 10 okay so we'll we'll let's just talk about churn we want to increase our our revenue for this specific product by 10 and then you look at the product and you go okay well one way you can do that is by growing your subscriber base another way you can do that is by limiting churn so you go back to the person okay you said you want to increase revenue by this much so why don't we think of a project where we actually increase the subscriber base so you tell the executive that they go no no that's not going to work we've already saturated the market there's no more new subscribers that we can tackle and our acquisition costs are going to be too high so if you if you if you get to that point quickly you can immediately say okay we're not going to start immediately going down the rabbit hole of what can we do to acquire new customers so let's look at the other thing okay what about churn what if we reduce turn by this amount and then the person goes okay yes the executive goes yes exactly so um the way that i think that we should that we should increase our revenue is by limiting churn and so now immediately you're like okay great i understand that it's a it's a it's a churn issue so let me get to tackling that so i mean in general i mean i think this is this is this is what's what's really important is that you you by ex making explicit things that are implicit in what the executive is saying drawing them out i think is really important for for at least starting on the right foot because otherwise you might you know you might assume that they're saying one thing but they're saying something totally different and you really have to bring it out of them um so so anyway so that's the first piece so this this idea of operationalizing exactly what your success criteria are as early as possible i mean basically as soon as you have that first meeting you you you make sure you understand exactly what it is that they want they that they are interested in tackling and anything that is vague is made as explicit as quickly as possible because that means you can immediately immediately get started now once you've operationalized that criterion uh what your criteria for success is the next thing and this is kind of obvious but you would be shocked at how sometimes how difficult this is is getting access to the data that you need as early as possible in a project's lifetime basically as soon as you have that first meeting you need to get access to this data whether and it doesn't it doesn't actually mean you need the full database you need some you just need something you need a sample you need uh just anything where the actual real date what the actual real data looks like it doesn't have to be a like i said like access to a production data warehouse it doesn't have to be all of the flat files that have ever existed just something because ultimately that's the only way you can really measure the true amount of effort this is going to be necessary for this project to actually become uh viable okay and the reason i say this is because the only way and this is the only way that i've ever seen anything work in an organization it's not because of i don't think it's not because of anything nefarious but it's just because people assume they know things that they don't it's that you don't know the state of that data never ever ever trust anyone's claims about cleanliness data structure data frequency um just any assumptions that they have or what they say they think they know about the data until you've done basic eda on it so exploratory data analysis on it you don't know anything about the data it's basically it could be anything they could tell you that it's in a pristine state and then you get it and like 90 of the columns have you know 50 missing values you know one column might actually have uh combined like several different data formats so like i've seen plenty of times cases where you have a time what's something that's a date is actually both a time stamp a month and a day and then sometimes it's just like things as unix time i mean there's just so many things that that that you have to check and see and the only way you can do that is by getting access to a snippet of what you're going to be working on uh as early as possible okay um because the earliest you do that the better you can understand what your true timelines are going to be um then the next the next part i think this is just the inverse of what i had said earlier so i'd said enemy uh uh uh i'm sorry the perfect becoming the enemy of the good enough what i'm what i'm saying here is is look as soon as you have something that's at some reasonable criterion stop focusing on the ml and start focusing on stop focusing on the core ml parts of your prod product and immediately allocate as much uh as many resources as you can to actually standing this thing up because although it's fun to get incremental there's like a little dopamine rush right there's a little dopamine drip that happens anytime you get a slight performance boosting your model it's not the stuff that's actually going to get your project across the finish line right what you need to do is you need to you need to be building out those uh the non-core ml parts of your project in order for it to succeed within the timelines that you've uh told people that they're gonna that they're going to have right um so anyway so uh the monitoring the visualization the pipelining stuff is the stuff that you need to build to get to the finish line so start building it as soon as you possibly can um because that's ultimately what's going to provide true value to both your business and to your to your senior stakeholders and then lastly this is just like a general uh sourcing and uh allocation and estimation uh thing that i've i've learned to do after being burned a couple of times basically you have your internal estimate of how long you think a given project will take just double that and then uh give that as your actual timeline to your senior stakeholder so if you if you think that something will take you a month and a half to do uh internally tell the senior executive that it's going to be three months and that way when you do deliver it in two and a half months or in two months uh it's seen as a you know is a huge uh improvement and like everyone's really really happy this again this is the idea of you always want to uh uh uh under promise and over deliver doing that is the way that you get uh uh senior executives to buy in um uh effectively and consistently on the on the projects that you that you begin yeah i think this is really solid advice and especially when applied to organizations who are still maturing their data science competencies i would assume for example access to data is not a major problem at the ubers and airbnbs of the world but this is quite a ubiquitous problem throughout the industry and uh yeah i think this is super useful now with that in mind i would like to segue to data democratization because i think uh and really important aspect of data science is not only producing data products but really equipping the rest of the organization with the ability to work with data themselves um how do you view the importance of democratizing data uh for data science teams as as a strategic imperative yeah i mean i think i think the the question almost answers itself right like we we know that that that um you know uh data is really important data is what's going to uh unlock the largest amount of business value when applied correctly so if you can get more people in your organization that aren't pure data scientists or or data engineers or data analysts to have access to that data to start thinking about it the more successful that you're going to become so i mean i think it's absolutely critical to provide people outside of data science organizations the tools to be able to uh uh what i like to call phishing for themselves right so uh if you can if you do not have an expert background in a in a you know a scripting programming language uh in statistics or in you know in in i don't know in sequel or something it would be really really great if you could still get to uh you know even half of the kinds of questions that you want to be able to answer but you currently can't because you just don't have the necessary skills right and ultimately like having data scientists do these things um is is a bad use of resources right like like getting people getting people in marketing or in or in um i don't know in product or in some other uh part of your organization answers to questions that they have for them that they should be able to get themselves if the data was in a reasonably structured enough or or you know is it was was was placed somewhere where it could be easily accessed by non-technical people it would just impact them so so much more right like i said data scientists are expensive and they spend uh the vast majority of their time organizing and cleaning data much and much less of that time actually mining right so if you think about it if you can get even a small fraction of your organization uh but more than currently are like this to be able to either you know i think my dream would be if everyone if everyone could use sql in the same way that you they use excel right if they could get if they could even get to that point i mean the entire organization would benefit just so immensely right uh their abilities the organization's abilities to tackle questions quickly i would bet would grow by like an order of magnitude um so yes i think data done democratization is super super important yeah 100 and even for example internally at datacamp we have a centralized data warehouse that on top of it you can have like metabase or some sort of connection sql database and most most people at datacamp know how to use sql and that has really enabled everyone to answer their data questions if you want to if you're on the sales team you want to see who's the account that has the highest sales you can check that out immediately if you're on the marketing team you want to optimize spend somehow all of this analysis is really done immediately through metabase and uh it has really changed like any organization there are so many things that we can do to become more data mature but uh it has really changed how we interact with data in that sense well this is highly use case or industry relevant what are some of the low hanging fruit that you find data science teams can quickly implement today to further democratize data science and to as you said give people the ability to fish for themselves yeah so i think the the the quickest things that they can do uh are two things one is you provide and aggregate a data view um at the level that uh business analysts right would typically see data and surface it up to people so they can plug it away at it in a dashboard like environment so something like a tableau i think that would probably be the first thing so you basically you figure out um some reasonable level of data granularity you surface up a table at that data granularity you provide updates to that table basically you could it could be a materialized view right you just have a materialized view that's constantly being updated with fresh data at a um at a set aggregation level and then you surface it up either in a dashboard or uh uh really i think the the other way is something like what you just talked about like a metabase i know there are other uh tools that provide effectively something like an excel light connector on top of the data warehouse so if you can even do something like that i think that would be that would be a very easy quick win i mean i think if you can get something in a place where people in the organization are reasonably comfortable with something similar to that so providing something like an excel like product but that actually connects to a data warehouse that has like i said you know terabytes of data i think that that would that would be one one very very quick way to unlock that value now the unfortunately you will have to do some kind of kind of socialization around the do's and don'ts for that right like so i imagine you know if you have people that are that are you know you have a data warehouse and people are using sql that's well and good but you have to just let them know like if they're looking at you know if they're looking at you know the top x in y so top sales person or the top account uh across the organization or like who's i don't know who's generated the most um amount of sales over the past month or whatever it is that's great but for example if they're trying to get larger segments of the data knowing that there are certain things you shouldn't do because it could brick your database is really important so right so for example like people doing a select star without a limit for example or things that or uh you know like basic kind of um uh pitfalls that they should avoid when doing um uh one when performing these kinds of queries right and i assume that those same kind of things would happen even if you had an excel like connector but i think those are the the the two things you can do one provide a reasonably uh uh a reasonably aggregated view that can then be accessed by people and then once you have that reasonably uh reasonably aggregated view have something like an excel like connector that people can connect i'm gonna actually now i've i've never used metabase but now i'm very interested in what this is um so i'm gonna have to check that out thank you for that intel uh um but but i know there are other yeah i know there are other tools like that that provide basically a layer a query layer on top of your data warehouses so i would i would suggest that as well yeah exactly and your experience as well what do you think are the obstacles standing in the way of enabling really mature robust data democratization uh and what do you think are some of the tactics that can be alleviated there yeah so i mean i think uh this is gonna be my my plug for data camp here in this in this podcast i think the first thing that i think the first thing that's uh that i would say is like look there's there's fundamentally a skills disconnect between what's necessary in the modern in the modern database um company and what people actually possess so i think the first thing that people should learn is just learn sql take some courses on data camp and learn sql if you don't want to use data camp go somewhere else and learn sql but learn some sql like i think that is uh absolutely the most the the best way for a person to level up their data abilities uh uh nowadays if you can learn sql it's the excel it's older than excel more powerful than excel um at least from a data munching perspective maybe not from uh from from a bunch of other perspectives but i think that's really important i think anybody that is comfortable in excel should learn to become comfortable in a sql environment um and i mean you think about a business analyst i mean they live and breathe excel right they know it really well then i do all sorts of crazy i mean i've seen people do stuff in excel that i was just like what is this this look this this is not excel this is like some weird abstract like you have like scripts in here i mean there's like where the cell where a cell has literally enough text inside of it to fill an entire page and it's like dude you're literally yeah you're you're literally like like creating a mario clone inside of this excel spreadsheet like like what are you doing so so you clearly have if you can create you know if you can create tetris inside of an excel window you can you learn to use sql okay you are very very good at you are very very good at manipulating data it's a little bit different way to think but you should totally be able to do that right um so and in general that's kind of that's kind of the way that that that you're going to have to mature as a just technically yourself and just your entire data org as a whole because like data now lives in the cloud right like there's no the the time of people passing around excel spreadsheets and saving them to some like j drive somewhere i mean that's going to keep existing but that's not where you're going to be that's not where your golden records are going to live if you as an organization are still living in a world where everything that all of your master data sits across 300 or 500 different excel spreadsheets which is a really i mean maybe that's sustainable in the near term but that's just not going to be sustainable over the longer term so put that that stuff's going to be in a cloud data warehouse it's going to be in a in a cloud database of some sort in order to be able to access it in order to be able to perform any kind of a non um trivial query on it any kind of a non-trivial aggregation on it you're gonna need to use sql so you might as well learn right so i think that's just absolutely important i think you can do that in lots of places i know the data camp has some excellent uh sql courses so uh those of you out there that want to learn more you should totally check that stuff out i didn't uh write any of them so this is not a plug for any of my courses um but uh anyway so that's i think the first uh the first thing i would say then the next is uh i think a uh it's more of a um uh a description of the kind of the entire data enterprise as a whole and it's really this lack of understanding or an appreciation of how the way in which whatever your source data is wherever it's coming from those signals how they're collected or stored impacts how quickly or easily a given question can be answered so i think this is more of a this is this is more of a a problem again for senior executives where they just assume like hey we have this data right so why can't you just answer this question right like we we have the data we have every every possible interaction that's ever happened on our website or on our app or on our service whatever it is why can't you just tell me how many extras are and why and the reason that i can't tell you that is because of the way that one the data is stored or two the day the way that the data was collected and so having an appreciation that hey if for example uh the way that we store sessions is separate from the way that we store users means it's very difficult to figure out how many unique users were on your platform over the past month um it means that like if senior executives don't appreciate that if they don't understand that hey like the way this data is actually sourced makes it so that what you think is a simple question to answer is actually a kind of a difficult question to answer it's not nearly as trivial as you thought so basically this this this you know getting getting people and i've definitely had to do some of this myself and it's paid dividends because they then can push back on others and be like why the hell was this built like this why did we not you know do it this way and that immediately i think it it basically makes it just makes the um accountability for how data is processed cleaned how it's stored a lot more visible a lot more transparent and when you identify why there are these kinds of breaks in the system uh uh or or gaps in the system it immediately makes it so that everyone has to talk to each other a lot more and the more that they talk the more ultimately hopefully the issues that they're having are going to be solved so i mean i think i think that's that's really the the other the second obstacle to this data democratization it's it's really this this notion of like hey being able to understand how the data was sourced and being able to to effectively evangelize how the data is structured immediately lets people lets people understand and know two things one is the they have more of an appreciation for how difficult some things are but they also now uh also have a much better understanding of like hey if something takes a while it means there's probably some significant gaps in the way that things are happening so i mean i know this isn't really uh uh a point on democratization but i think it's more of like a point on on democratization of of the challenges around data as opposed to data access if you will no i i think democratizing data access and giving providing extreme context around how data is sourced and how it impacts a given function is also super important in the the formula of data democratization sergey it was a huge pleasure chatting and before we let you go do you have any other call to action to make uh you know i um one i wanted to say thank you very much adele this was really really lovely uh this was a a great way to kind of for me to relive my entire data journey up to this point so uh this was this was this was cool um uh i think really the only call to action i really have is uh i think everyone should you know practice a little bit more humility when it comes to data science i think i mean this stuff is hard and i think that everyone most people have their they're kind of their best uh um or the your companies or whatever it is that you're doing that you have the best interests in mind i don't think everyone is out there nefariously trying to ruin data projects uh so i think practicing some humility when it comes to both you know practicing data science and also evangelizing for data science is really really important and i think as part of that like when you're when you're humble it means that you always have room for growth you have you have more opportunities to learn i think you know ultimately um learning is the way that that you make both the largest impact within within any organization that you work for but it also makes for a much more meaningful life i mean i've i've enjoyed learning uh about new techniques news new uh frameworks and all of that stuff and i think as long as you keep learning uh as a data scientist or really just across your entire life uh i think you'll find that your work and the things that you do are gonna become a lot more meaningful so that's that's my hope for everyone stay humble and stay stay learning that's awesome and i would highly recommend that you check out extreme gradient boosting on data camp taught by sergey fokelson and with that in mind thank you so much sergey thank you thank you again adele that's it for today's episode of dataframed thanks for being with us i really enjoyed sergey's insights on how data science has evolved over the years and what is still remaining to really scale the impact of data science across organizations and industries at large if you enjoyed this episode make sure to leave a review on itunes our next episode will be with bar moses ceo of monte carlo data on the data quality challenges data teams face and the importance of data observability and reliability i hope it will be useful for you and we hope to catch you next time on data framed\n"

#63 The Past and Present of Data Science (with Sergey Fogelson)

Random Videos