#15 Building Data Science Teams (Drew Conway)

The Emergence of Data Science Teams and the Challenges of Building Industrial-Grade Data Products

Drew, a data scientist at Alluvium, recently joined me on the show to discuss his team's approach to building data science teams and the unique challenges of building data science products for industrial users. Drew shared that he believes that data science teams should cover all aspects of their work, from data to insights and actionable recommendations. He emphasized the importance of communication across these different functions, highlighting that this is just as crucial as technical expertise.

Drew also discussed his team's approach to recruitment, noting that it reflects the job itself as much as possible. This means that they look for individuals who not only have strong technical skills but also a passion for data science and a willingness to learn and adapt. He emphasized the importance of building a diverse team with a range of skill sets and experiences.

One of the key challenges that Drew's team faces is the sheer volume of data that industrial users generate. This is significantly higher than what Twitter generates on a daily basis, making it essential for data scientists to have the right tools and methodologies in place to handle this volume. Drew highlighted the importance of developing tools and techniques that can scale to meet these demands.

Drew also discussed the role of development tools and methodology in building data science teams. He emphasized the need for collaboration between different departments, including engineering, product management, and business stakeholders. This requires a deep understanding of both technical and non-technical aspects of the business, as well as a willingness to adapt to changing requirements.

In addition to the technical challenges, Drew also highlighted the importance of managing data-driven systems in production. This involves ensuring that models are properly instrumented and measured, and that they can be compared and contrasted to determine which ones are performing better. He noted that this requires a different set of skills than those required for building individual models.

The conversation also touched on the topic of DevOps and development ops for data science teams. Drew emphasized the need for clear communication between different stakeholders, including developers, product managers, and business leaders. This involves defining clear goals and metrics, as well as establishing processes for deploying and maintaining data-driven systems.

Finally, Drew discussed his vision for the future of data science teams. He believes that these teams will continue to evolve and adapt to changing requirements, requiring a deep understanding of both technical and non-technical aspects of the business. He emphasized the importance of collaboration, communication, and continuous learning in building successful data science teams.

Actionable Advice from Drew

If you're interested in building a career as a data scientist or joining an existing team, Drew offers some actionable advice. Firstly, he recommends starting to build your skills and experience by writing and talking publicly about what you do. This will help you build a network of contacts and gain confidence in communicating complex ideas.

Drew also emphasizes the importance of being willing to learn and adapt. As data science continues to evolve and adapt to changing requirements, it's essential to stay up-to-date with the latest tools, techniques, and methodologies.

For those just starting out, Drew recommends taking a baby step towards teaching others about what you do. This can be as simple as giving a presentation at a meet-up or writing a blog post about your experiences. By doing so, you'll gain a deeper understanding of your own skills and ideas, as well as build a network of contacts who can provide support and guidance.

Recruitment at Alluvium

Alluvium is currently hiring for data scientists at both the mid and senior level, as well as entry-level positions. The company also has opportunities available for back-end engineers, DevOps engineers, and product engineers. If you're interested in joining the team, be sure to check out their careers page at alluvium.io/careers.

Conclusion

Drew's conversation with me highlighted the unique challenges of building data science teams and products for industrial users. From handling massive volumes of data to managing complex systems in production, these teams require a deep understanding of both technical and non-technical aspects of the business. By embracing collaboration, communication, and continuous learning, data science teams can build successful products that drive real value for their organizations.

As we look to the future of data science, it's essential to stay adaptable and willing to learn. Whether you're just starting out or looking to make a career shift, there are many opportunities available in this exciting field. By following Drew's advice and staying up-to-date with the latest tools and methodologies, you can build a successful career as a data scientist.

Data Science for Social Good

In our next episode, we'll be joined by Mara Avrech, data nerd at large and tidy verse development advocate at Alluvium. We'll discuss the role of data science in social good, including civic tech and sports analytics. We'll also explore the role of data science paradigms like the tidy verse in the data science ecosystem as a whole.

This conversation promises to be informative and thought-provoking, and we can't wait to dive in with Mara. Make sure to tune in for our next episode!

"WEBVTTKind: captionsLanguage: enin this episode of data framed a data count podcast I'll be speaking with drew Conway world renowned data scientist entrepreneur author speaker and creator of the data science Venn diagram drew and I will be talking about how to build data science teams along with the unique challenges of building data science products for industrial users how does drew now view the Venn circles he created those of hacking skills mathematical and statistical knowledge and substantive expertise when building out data science teams stick around for this and much more to set the scene the first half of the show will focus on what data science looks like today through the lens of the evolution of the data science Venn diagram and the unique place data science holds in an industrial setting the second half will use all of this knowledge to focus on data science team building and recruiting I'm Hugo von Anderson a data scientist that data camp and this is data frame welcome to data frame a weekly data count podcast exploring what data science looks like on the ground for working data scientists and what problem is it consult I'm your host Hugo von Anderson you can follow me on Twitter as you go down and data cam at data count you can find all our episodes and show notes at data camp comm slash community slash podcast hi drew and welcome to data framed hey Hugo it's great to be here it's great to have you here and I'm really looking forward to chatting with you about how to build a data science team along with the unique challenges of building data science products for industrial users but first I'd like to find out a bit about you what are you known for in the data science community right so a long time ago I was probably known for being one of the earliest bloggers and data science so as the story goes when I was first admitted to graduate school NYU I was really excited about finally having the opportunity to speak publicly about you know the interest that I had and some of the work that I was doing so I started this little tiny blog called zero intelligence agents and and really just used it kind of as a public notebook the things that I was working on you know code that I was writing things that were interesting to me and then eventually kind of combined that with very early social media days on Twitter and and found that there was this small community of folks who were you know writing and tweeting about the interesting data stuff that they were doing even predating a kind of generally agreed-upon term of data science and then of course the thing that I'm actually most known for in the data science community is the data signs been diagram of course and then you were heavily and you've been heavily involved in the NYC data science community as well right yeah and so the you know probably the thing that that I'm most proud of in terms of my contribution back to the data science community has been my ability to or at least my concentration from being in graduate school at NYU and then and then building companies and building teams in New York is really kind of planting a flag for New York City as a great place to be a data scientist and great place to do data sighing yeah and you actually have a strong argument that New York City was a place where data science was being done even before data science was a discipline right yeah that's exactly right I mean you know we think about what the anchor industries have always been in New York and you know you can take tech out of that and just think about the financial services and banking industries the media industries and the advertising industries and all three of those are really data centric and so I think what what we see now in New York City is is that it's it's sort of always been this place where data and the robust analysis of data has been central to business and to profit-making and now with technology and its dissemination and movement into other verticals those industries themselves you know financial services and and media and ad tech have become a big part of what makes New York City unique and and then beyond that now I think with the university system and the the amount of startups that are in the space New York has become a really great place to do this kind of work I think there's a bunch of other things that are really unique about New York that that tend to enhance that one of which is just the geography of the city right and you know it's easy to always make comparisons to the West Coast but you know New York City is a tiny little island and we're all we're all crammed on it and if I want to go have a conversation with you I can just jump on the subway go downtown and there you are and if I want to go to a meet-up I can go there I can go speak to a professor at a university and everything is within you know five or six square miles of where I'm sitting right now and I think that really has changed how people in in this city can can do work that's right and there's such a strong sense of community around data science here agreed in and I think again part is just kind of the the culture of New York I mean people you know for better or worse you know we're loud we're brash we like to talk about what we're doing and that means that that a lot of ideas get shared that makes all the difference in the world and you've spoken to a number number of topics say that will we'll get back to your early days as a as a blogger really contributing to the evolution of data science and and and defining it initially as a career the role of community our data science in in New York City as well and also the famous Venn diagram so I thought maybe you could tell us a bit about that the Venn diagram I'm sure you don't want to talk about I'm sure you've been talking about it for years so I'm sure you don't want to talk about it too much but where where it came from and where you see its evolution has gone since then yeah sure so the origin story of the Venn diagram is is actually intimately tied to the the data science community in New York City you know about I guess a little less than than ten years ago when I when I first arrived in New York I kind of inserted myself into what was then this again kind of nascent community of folks in academia and Industry and the startups who were doing this work and so eventually there formed a kind of not an almost working group of folks who every month we would meet for a potluck brunch at the top of the New York Times building in the R&D floor which is all the way at the top of the New York Times building and just sit around on a Sunday morning and kind of talk about what this data science thing was you know we would have topics around you know how would you think about teaching it and you know names that are that I think are now basically associated with data science or most venerable names were there just thinking about it so you know folks like mark Hanson and Chris Wiggins from from the academic side and Hilary Mason and Mike the kiddies on the on the commercial side you know I was there and when was this this was back in you know 2009-2010 we were having these these morning conversations and you know sometimes they're mostly just for fun I mean we were all friends we knew each other from from various walks of life and we would just come together and chat and so you know one at one day we were having this conversation around you know what is data science like how would we think about defining it what are the requirements to be a good one and we have this wonderful conversation you know Chris Wiggins and Hilary Mason were we're kind of leading the chat and and I kind of walked away from that that discussion with a whole bunch of ideas in my head as to you know well okay this is this is what these guys think and here's how I might interpret this and so the following week this is when I was still in graduate school so on one particular class I sat all the way in the back of the lecture hall and just opened up my laptop and started kind of thinking about how I would define data science based on the ideas that that had been discussed at this potluck breakfast which ultimately led me to you know firing up my you know open source illustrator and creating what is now the data science Ben diagram and that went and then ultimately wrote a blog post about it that went as phi role as a data science post could go viral in you know circa 2010 and will definitely link to that blog post in the show notes and of course in the middle of that Venn diagram you have data science but maybe you can tell us the things that revolve around it that a necessary skill yeah so you know the the the central des part of the debate I think that the that we were having back in 2010 and honestly it seems like many folks are still having although it you know it's shattered into many more dimensions now is you know what are the constituent pieces that some person should have if they want to actually be a data scientist and so you know I broke this into three big groups one is you have to be competent in using and developing software what I refer to as hacking skills and what I really meant by that is this is not someone who is a professional software engineer hacking skills means someone who is you know able to fire up the command line can manipulate text knows how to work with a scripting language so that they can produce repeatable maybe shareable and reproducible pieces of code that could be used to analyze data you know again there there wasn't a sense of professional application it was just you know do you know some stuff can you actually code code enough to be able to build kind of an MVP of something the other piece of it was kind of the academic side so if you're going to be building these things you should have some real kind of grounding in the statistics and the mathematics that go into the models and the methods that you're using alright if you don't have that then you may then you may simply be kind of pointing a very powerful technical weapon at data and not actually know what's going on and then the third piece which I think ultimately becomes kind of the the the glue that brings it all together is what I called substantive expertise or really kind of subject matter expertise and this has nothing to do with your skills as a coder or your competency as a statistician but more do you know how to ask good questions right because at the end of the day and again thinking about this back in 2010 ultimately what I real what I was observing how in the in the kind of intellectual marketplace so to speak is that there tended to be a lot of people or most people were good at coding and a lot of people had or could get training in statistics in math but they didn't really know how to ask good questions and if you don't have a kind of point of view on a problem or point of view on a data set then you're kind of starting with nothing because no matter how much data analysis you do if you're asking the wrong questions you're kind of just you know treading water and so we combine all three of those to create data science and of course there's the the kind of secondary overlaps that occur between all of them and and I think the one that for a lot of people was was most satisfying was you know people were trying to make this distinction between well as data science machine learning and what's the difference and so what seemed obvious to me is if you know if you have hacking skills and you know about statistics and math you put those together that's really what machine learning is and certainly was those many years ago and so we you know we kind of built that and I wanted to balance that with what I viewed again at the time as a PhD student as what kind of traditional research is right so if you have this methodological grounding in statistics in math but you also have substantive expertise as always in a political science department so people who have you know who are working on American politics questions or looking at international relations and conflicts questions they know a lot about those subjects and oftentimes they can apply specific mathematical to try to estimate what some of what what they're seeing in data but that's not data science that's traditional research and so I kind of had this overlap of traditional research and then maybe the the other overlap that was that was I don't know if it was controversial but but people had a lot to say about it was his idea that if you had substantive expertise that is he knew you knew a subject well enough to ask good questions but then also we're able to kind of hack and write code and get some answer if you didn't know the statistical and mathematical rounding and what those answers meant then you're in the danger zone right often referred to this as kind of you know you knew enough to be dangerous and that's kind of the worst place to be because then you could create very misleading results and I was trying to use that as a guard against what what I would hope would be folks not you know putting data science on a path to you know snake oil yeah absolutely and I actually that the danger zone is is very interesting to me I love that you put an exclamation point at the end of it to draw even even more attention to it but I think the fact that the idea that people can especially with you know the the raging success of all these new fantastic api's that allow people to fit and predict a variety of models after importing data without necessarily knowing a lot about the models they're they're using it is actually incredibly dangerous yeah and I think you know there's there's there's a whole bunch of dimensions to this right I think you know take this take adverse from a like-kind even from a commercial perspective the the thing that I found most interesting about how data science tools and even platforms and applications have grown over the course of the last you know five to ten years for a while there there was this real attraction to say building data science in a box tool so it's like you know take your data set and stick it into this tool and it would it would predict for you all the possible outcomes for something and like wallah there you have a you know a you have a model you can put that model in production and it's great and so these tools were targeted people that had substantive expertise right so I was like I can build this tool and I can go sell it to someone in an insurance company and that insurance company will have better actuarial tables it doesn't matter that the you know the folks in the insurance company might not exactly know what my random forest is doing and why it's making those distinctions it just matters that they're getting better results then I remember being around and hearing about a lot of those companies and quite honestly at the time thing it seemed like a reasonable idea but having this kind of sinking feeling that they were building tools that were kind of in this danger zone realm and what my observation has been since then is that the reality is that a lot of those tools don't really fit into a real use case right they're sort of in this valley between two real use cases one being okay you have you know no hacking skills or methodological skills and your primary tool is like Excel and you're good at making charts but you don't really you don't really know how to how to build things and so you have a specific kind of service-oriented need and then there's the other folks who are actually are really good at the substantive stuff and the and the mathematical stuff and so they need they need really granular tools to do their work and so they're the ones that are going to ultimately probably learn or or learn Python and become data scientists and then but there's nobody really in between and so ultimately I almost saw tools being built that fell into that danger zone and ultimately didn't have a lot of success yeah and I think that also speaks to the fact that pursuing let's say accuracy or model performance at the expense of other other qualities is also inherently dangerous right and now we're seeing kind of a rise in a desire need in society for machine learning interpretability which brings us back to more substantive expertise I think that's a really good point you know there's these things can to Heaven flow and so there's there's kind of this natural I think early attraction to black box tools because in a lot of this I think follows almost from a lot and sometimes almost the negative downside of the Venn diagram or folks kind of viewing these holistic definitions of things as being these kind of almost unicorn like individuals and so what that does is it casts a shadow over the discipline that says okay well only a very specific kind of person can do this somebody that has all of these things and if you don't have that well then you need a very specific tool to do it and of course you know history of data science and misery of many technical craft has really you know is not about finding one person who does everything right it's always about finding a group of people who know about who know a lot about some parts of it and can work together and so I think there's always as natural early inclination to say okay I can build I can build this tool that does this thing really well and I will try to find somebody to use it and again I think the the results in the market have not been not been great for those approaches for sure and you've actually preempted my next question in some sense I was gonna play devil's advocate and ask you whether you thought the entire Venn diagram was a danger zone in in itself in terms of it its potential for being misinterpreted and the search for the Unicorn yeah I think you know certainly that was I guess it is still true if if there's one thing that I wish I could have been clear around when I introduced it is that it wasn't you know it's called the data science Venn diagram it's not called the data scientist Venn diagram and I think a lot of people you know when they looked at it they said oh this is these are all the skills I need to hire for if I'm gonna hire a data scientist and really the idea that the diagram is that this is what the discipline is right and so if we think about other disciplines like software engineering we don't think that there's a canonical software engineer that does everything right and again I think it's the same for data science and and unfortunately in those days you know in to some extent still today I think people view these kind of holistic definitions and as you know the Venn diagram I think is become useful shorthand because it's you know it's an image it's easy to share but there's still a lot of you know pixels in ink that gets filled trying to holistically define what the what this career path is and ultimately I think that's that's that's in some sense a waste of time because we sort of know we know this movie's gonna end we've seen it many many other times and so we should be thinking about it in the context of teams and how people work together yeah absolutely and that's something we'll get to the the future of of data science and I will you know make it clear that your Venn diagram doesn't say data scientist in the middle it specifically says data science and in fact I mean it's not necess surely just a unicorn I think you had a great slide at Jared Landers conference nyr which had a unicorn with a cat with a laser gun already right yeah yeah and I wish I could credit whatever the artist was that created that because I think it's wonderful yeah I mean you know if I recall from that talk you know what I was thinking back on then was was actually part of these early days and you know kind of right at the turn of the decade in kind of 2010-2011 where I was in New York City I was having lots of conversations with people at various companies from early-stage startups to you know fortune 500 companies and when they heard that I was a data scientist or that I could you know help them find data scientists the the punchline of that joke is you know that is that is who they thought they were meeting with right this this cat riding a unicorn with a handgun and a you know flamethrower or something and of course we know that's not true and and the the further the further we get to a professionalized disappoint of data scientists the further that becomes true I couldn't agree more and something that I'm looking forward to chatting about later in this in this episode is about how you'd go about building a data science team from this from this Venn diagram but before we get there I'd like to know a bit about what what you do these days what do you spend most of your time doing I do almost no data science in fact so I'm the founder and CEO of a company called alluvium what we do is we build data data and products for men and women working in complex industrial settings so what I do most of the time is I think really hard about what their problems are and in particular how we how we can build tools that help them better leverage data to make decisions what that means is I spend a whole lot of time listening to our customers and asking them those kinds of questions of course I also spend a whole lot of time listening to my team you know my teammates and answering their questions and learning about what kind of techniques you know pretty on the data science side they think would be most applicable to solving these problems and then I also have an opportunity to speak to two folks who might want to think about working for us and so I do I still do a fair amount of recruiting and thinking about you know how to best explain what we do to folks and and how to get them excited about working with us fantastic are you currently hiring oh yeah yeah yeah that was gonna bet that'll be my call to action at the end of you kind of fantastic and so it actually sounds like in some respects you're acting as an interface between the substantive expertise of the industries that you're working with and the hacking hacking skills and and mathematical and statistical knowledge actually that's a great way to think about it you know when when I found it alluvium there was there was no no question in my mind how little I knew about the day-to-day lives of someone you know working in a in a in a power plant or working in an oil refinery but in is in fact one of the core values at alluvium is is about learning and learning firsthand we call seek the first-hand we always want everyone in our company to think about how they can go out and through first-hand knowledge learn about something new and in particular learn about how our customers do their work and and so that's substantive expertise and how industrial operations work how they you know what kind of data gets generated how that data how that data generation process gets instrumented who the actual people who are you know standing on the front lines making decisions with that data who are the people standing in the control room who are observing that data and you know who are the folks back in the headquarters building making business decisions from that data we we want to seek and learn about all of that work so that we can go about building products that actually you know support them in their day-to-day yeah and if I recall correctly your web page says that a lot of data scientists will put on hard hats and go out there in the field oh yeah and not just the data scientist and you know the whole team gets out there we have yet to get alluvium branded hard hat so we're often relying on our hosts to provide us with them but it's it's probably one of the most exciting parts of the job a year ago we went to the the large recycling center out in Brooklyn near in New York fascinating to see how the city of New York Campbell handles its waste and how they how they try to improve the efficient use and recycling of that and then this past winter we we went to a robotics consortium in the Navy Yard and learned about how they're using robotics for for art and for industry and for and for startups as well and I think you know learning has become such a core part of how we do our business that I'm I'm always excited to get a chance to go out and see how people do their work it sounds like an incredible opportunity and when you said you haven't got alluvium branded hard hats yet I just wondered whether you've tried to put alluvium branded laptop stickers on them at any point yeah it's a good idea well you know the nice thing is that I guess we could just you know maybe your your your pre-empting me or we could just buy the hard hats and then just put the stickers on that's that's the easiest thing to do yeah exactly so I'd love to talk a bit more about alluvium and I'd like to kind of motivate my question by by quoting you are paraphrasing you you've said that much of data science is a stack of tools developed to deal with big data and designed for the web but what can data science do for non-digital industries so that's really to frame my question which is what are the major challenges that you're trying to solve with your work at alluvia yeah so you know to kind of go back a little bit to that context when I founded the company I was I was coming off of having worked at a at another startup here in the city which which was which was a consumer health company and we were building a product where we were trying to kind of use real-time streaming physiological and telemetry data to help people understand kind of their their overall health and and part of what really attracted me to that opportunity is is actually working with streaming data from the real world in the early early part of my career I'd worked in the international security world and field and I dealt with a lot of data from sensors whether it was telecommunication sensors or measurement and signals out in the field and then you know using that and combining it with highly unstructured data like text reports or images of maps and things like that and and one of things that really stuck for me with from that early experience is that even in those days in those days I mean you know kind of mid 2000s when a lot of what we think of now is this kind of commoditized stack of big data tools really didn't work well for dealing with that that we just had to develop a lot of ad-hoc methods for dealing with that data and so fast-forward to you know where I was working for our two alluvium I I returned back you know almost a decade later to find that it's mostly still the same right that we it's we and by that I mean we as a kind of technology and data community had had seen that there was a ton of value in you know web block server files and search results and clickstream data and all these things that were were produced by and used with in kind of digital platforms but we hadn't thought a lot about how do we how do we do the same sort of stuff with data that's generated outside the web right it's sort of like these these physical systems are so complex and there's so many things that are hard to observe and we also have poor ways of measuring them and we also don't have good software tools for dealing with them right it's sort of like example that I like to say is it's still really hard to predict the weather and you know more than a day out right and part of the reason for that is you know the earth is an extraordinarily complex system and we don't really have good ways of measuring it and we certainly don't have good ways of measuring it and doing analysis on it so when when I had the opportunity to start thinking about my own company and the kinds of problem that I would want to solve the thing that I realized is that this technical problem was still highly present that there was just not a good way of doing kind of distributed real-time unsupervised learning from data from these kind of physical sensors in any unified way right if you had multiple assets across multiple physical locations and you wanted to have a kind of uniform view of how all those things were operating and you want to do that learning without any training data how would you think about doing that so that was kind of the technical spark for a lluvia m-- and then ultimately the you know the the founding spark granted the commercial spark for it was was realizing where that problem was most acute so kind again reaching back to this kind of seek the first hand idea I just got out and started talking to folks and quickly it became clear to me that you know the industrial space which has for hundreds of years been in a highly data-driven set of industries whether it's the oil and gas industry the manufacturing industry both for you know discrete and process manufacturing all of which were really good at collecting data but none of which had really matured and in what they would refer to as kind of their digital transformation right these are these are processes and systems that are still in many cases highly analog and even when heavy investments have been made in in generating data there's just not a lot of good tools for doing anything with them so those two ideas kind of slam together then we got to work could you give us an example or a case study of some of the work you've done sure so probably the the best example that I can think of that more sort of you know able to talk about right now was actually an early early pilot that we did with the New Orleans Police Department so this actually Ulta ultimately ended up not being a path that we decided to take commercially but at in the early days we were really interested in what our technology could do say in inside a vehicle you know modern car is basically a motorized computer and so it generates a tremendous amount of data but you want to be able to have both a kind of local view of what wow that vehicle is operating and then kind of a global view and the way that we talk about that at alluvium is through this idea of stability and stability kind of forms the central you know not only language that we use around our products but you know in some sense kind of the core value proposition of the business we right we want we want to provide our customers with a view of the overall stability of their operation and then when those things change we want to be able to quickly alert and guide an operator to where in a system that instability may be coming from so that they can very quickly make an evaluation of that and ultimately take an action if they need to and so in the case of the police department we built a prototype in a pilot for them based on vehicle operation the idea was we wanted to be able to to show how police vehicles were operating in the city and actually do putting software inside the vehicles on the on the vehicle laptops to stream data from the the the OBD sensors the onboard diagnostic sensor which has you know a huge amount of information that you can draw from it to give a kind of global view of the stability of the the vehicles out in the field and so you know we built that ultimately we decided that there was much bigger opportunity in plants and factories and that's where we we focused our attention but we were able to build this this this prototype in this this pilot for them we're able to see you know changes in vehicle operation how that changes stability and we can ultimately you know produce some real interesting insights great so what essentially we're also talking about is not data science standing alone by itself but actually building data products as well yeah and that's that's kind of the whole gig right I think we have a particular point of view at alluvium that you know data science machine learning AI whatever whatever you want to call it it only takes you so far right ultimately if you're building a product that is there to support someone making a decision then you need to think about where is the point in which their knowledge their context their expertise need to take over and ultimately make some you know decision adjudication based on what you're presenting them I often you know when I'm introducing what we do to you know say folks in in the industrial space and or potential customers we kind of talk about this tension between data discovery and then data reasoning or reasoning about data and so you know kind of put yourself in the role of an industrial engineer who's standing inside a refinery alright they are basically beset on all sides by this kind of wave of information I think I forget what the exact statistic is but I think you know the average oil refinery will produce more data in a day than you know all of Twitter right in the same time period and so if you're that person standing there and your job is to mitigate any any problems and to track how this process is working there's just no possible way that you you as an individual or even a highly competent and highly trained team of mechanical and industrial engineers could could could do that high dimensional math problem in their heads or even use tools to do it but a computer is really good at that right a computer is really good at taking in lots of information performing lots of of analyses on it and applying lots of dimensionality reduction methods to that data to try to identify you know what our overall or systematic changes in it and so we believe that well-designed data tools should really be pulling the cognitive responsibilities away from this data discovery to data reasoning because computers are really bad at reasoning about data and they don't really know why something changes they might just know that it does change but a person radicular Lee a highly trained industrial engineer knows exactly why something might be changing if they're presented with the right information at the right time and so we think about you know what is the equilibrium point or the perfect optimal point in which we can kind of handoff and automatically generate it finding about these kinds of changes to an operator who can quickly you know move through that information and make an evaluation or take an action and then based on that action have the system learn from that and get smarter and get better at identifying important changes or as it changes that aren't important because at the end of the day we want that experience for that human to be as good as possible because we really want to respect their time because you know didn't justify nothing I'll say in this is that you know our customers are a little atypical for data science products because I don't really care about software at all right sometimes software is like the thing that they have to do or the thing that they go to when they really need help with a very specific kind but their job is to run a plant and that is a that is a physically intense job not one that that typically requires staring at a computer screen for very long but when they do look at a computer screen we want that to be a really high value interaction exactly and software is a tool to help them answer questions and to get deliverables exactly and it's in particular it's a tool that they're that they tend to be pretty love right so you have to really be able to show value quickly yeah I mean I've been doing this longer than self weighs being around right exactly yeah and I you know one of the things I talk about with the team is you know if you even generously if you think about all right well what is the history of kind of modern Big Data as we think about it today you know it's roughly ten years old maybe fifteen years old if you assume that you know Google and Yahoo were developing these things four years before they were released but folks who've been working in the industrials place have seen hundreds of years of technology revolution change the way that they do their businesses and so our little drop in the bucket barely makes it waves after a short segment we'll jump right back into our interview with Drew to use everything we've just discovered to focus our attention squarely on Drew's approach to data science team building and recruiting let's now jump into a segment called rich famous and popular with Greg Wilson who wrangles instructor training at data camp hey Greg g'day so Greg what do you have for us today well as a follow-up to our discussion last time about using empirical methods to guide the design of programming languages I'd like to see someone use data science to find actual design patterns in software if you haven't run into the term before a design pattern is something that comes up often enough to be worth giving a name but isn't precise enough to turn into a single general-purpose library the term originated in architecture for example I think most people know what a porch is but if you try to pin down a precise definition it turns out to be surprisingly slippery similarly programmers use terms like pipeline or plugin in more or less the same way to describe more or less the same high-level concept but there are enough different ways to turn those concepts into code that there's never going to be one definitive implementation that everyone uses so how can data science help well data science is pretty good at finding patterns so I think would be cool to see if we could use clustering techniques to find patterns in the way code is structured and used you see the design patterns we have now are all the product of experienced programmers eyeballing their which is great as a starting point but you know different experts will see different patterns or put particular instances into different clusters and so on if we throw the tools we have against hundreds of thousands of source files will we find the same patterns in how classes are structured what if we look at traces of those programs execution will information about when objects are created and how they call each other's methods and so on give us more insight I can see how you would get source code from somewhere like github but where will you get traces from running programs I mean I'm all for helping science but I probably wouldn't let you install something on my computer to look at what I was doing I don't think we'd have to go that far there are enough software projects out there now with decent test suites that we could look at how the code runs itself and we could start small if you look at the work of people like your misogyny ma there's actually a surprising richness in how single variables are used a loop index isn't the same as a state flag which isn't the same as an accumulator but you might need dynamic analysis to tell them apart I don't know and I think the only way to find out is to have someone take a crack at it on software engineering research is already doing this some but the work I've seen has taken the hand-rolled categories as a given rather than trying to validate them or discover new ones and I think we've learned a lot by having fresh eyes thank you very much Greg if anyone in the audience is interested in giving this a try please get in touch we'd love to hear from you thanks Greg once again and looking forward to speaking with you soon thanks you go after that interlude it's time to jump back into our chat with Drew Conway as we discussed earlier your work that you personally do is for a large part of it is building and managing data science team so I'd like to really get into that now so hypothetically I'm gonna give you a million dollars to build a data science team how would you spend a million dollars to build a data science team and what would the team look like you know it's interesting the the amount of money is is obviously great because I could I could hire a bunch of data scientists maybe not as many as I would hope with a million dollars but you know the first thing that I that I always talk to folks about when they're thinking about building a data science team is I do you need data science fantastic do you actually need a data scientist right there's I mean I think there's a lot of companies that are data companies that confuse that for I'm a data science company all right so there's lots of great examples you know some of the ones I like the best there's there's a lot of opportunity in the world for taking old difficult to navigate data sets and making them easier to navigate and and easier to draw conclusions from but does that need data science you know do you need a model do you need some prediction or some classification to be good at that probably not so you actually need to hire data scientists and so the first thing that I think about is okay do we actually need that now okay if if we've if we've convinced ourselves that we do then the first question you need the next question you need to ask is you know what what kind of data science is important to my company in my company that wants to do research and develop things at the cutting edge and and I'm willing to put a tremendous amount of resources and a tremendous amount of risk into commercializing academic pursuits well if that's true then you know there's a certain kind of data scientist that you would want to think about hiring and building a team around right folks who actually have experience in independent kind of basic research around methodologies and and data you know might look more like a statistician than an engineer and so thinking about that is really important on the other side and this is we'll be more relevant to you know your listeners or or certainly is more relevant to folks in businesses you know is data science core to your product is it the thing that that actually gets people using it and buying it and coming back is it is it that prediction is it that classification is it that finding that you can provide them with data that makes them use the product is if that's true then you want to think about hiring for a different set of skills you do need software engineers you do need folks who understand how to collaboratively build software and know the that there's always tension between kind of purity of results and functionality of something in a production system and then you know then the place where were we're happy to be talking even more is is that okay let's think about building a recruiting process that actually supports the answers to all of those questions I'd love to talk about that more in in just a minute I do think the initial question does the company need data science is is incredible cuz I was actually speaking with a data scientist that at Google a while ago who said to me at an analogy that stuck with me she said I might mess up the analogy slightly but she said a data scientist to a company is like a tiger to a drug dealer and I said I said tell me more and right and she said well if if you're a drug dealer and there's another drug dealer down the road who has a tiger on a leash you're gonna get a tiger on a leash and she said similarly if you're a company and your competition has data scientists you're gonna you're gonna want to get data scientists without actually thinking about whether I need one to to build up my team or to you know meet whatever metrics I'm trying to make I like that a lot I also need to get a tiger now absolutely the the other thing I I found interesting about what what you're saying is that we discussed earlier that um the substantive expertise in in your line of work at alluvium comes from your customers and also your data scientists on the ground that they get out into the field along with everyone else in the company as much as possible how do you think about hiring around the other parts of the Venn diagram I mean presumably you don't require that all your your data scientists have a like a really strong background in in the mathematics and statistics from the academic view nor do they have you know computer science degrees yeah we take our you know what I often say to folks and this is certainly true for alluvium is you know your recruiting process is the first product that you're likely to build to you know to kind of MVP and completion you know if you're hiring someone to do a job then the recruiting process should reflect as closely as possible what how they're going to work in that job every day or at least a close approximation of it right the recruiting process is a naturally asymmetrical event right you know and your team if you're bringing something in to interview them is going to get a ton of information out of that person as to how you think they may perform based on the questions that you ask them right that's entirely one direction the other person that you're bringing in to recruit depending on how you how you how you build that recruitment product will get little to no or a lot of information about what it's going to be like to do this job with you so if I have a recruiting process that starts with you know a puzzle moves to a whiteboarding session and ends with a culture fit interview and maybe some live coding that hasn't that reflects absolutely not at all what my job would be like when I get to get to work so what we do at alluvium is is is we we try to build the whole process start to finish basically as you as you might imagine going from zero to a piece of code deployed into our production system as a data scientist through the whole process right so the first set of the the the kind of first round interview is mostly a you know a get to know you a little bit learn a little bit about what interest you learn a little bit about why we might be interesting to you in particular what is it about this sort of unique intersection of industry data science and product building that's really exciting to you and then you know a little bit of a technical screen just to just have a little bit of a baseline to ask okay you know what kind of tools do you like to use now what are some examples of problems that you've solved with them and ten walk us through you know maybe a piece of software that you know you put into production or if you're just coming out of an academic position you know a piece of software that you used in an in a paper and how that process went and then we have a you know what kind of take-home coding exercise but the the coding exercise reflects exactly the kind of problems that that we would use and so you know in our case we actually you know put a lot of effort into building the exam it's a you know we even built our own streaming service it's kind of a stylized problem but it's based on a real problem that we worked on having to do with wind turbine data so we created a streaming service that streams and emits real data from a wind turbine and you're asked to you know kind of do some simple exploratory analysis and then try to try to predict some values coming off of that wind turbine and then you you know you submit that as a pull request to a github repo that we give you as part of this this exercise and then we as a team will look at the results of that and you know if we like what we see and we're interested to learn more about you and how you might fit in we enter we invite you for for an on-site interview and the very first inner you know the very first session in that on-site interview is actually a code review of what you did because if you're working for us and you submit code your code will be reviewed and if you're a data scientist that code will be reviewed both for the the actual technical code itself but also the methods you use and so we we like to we like to be pretty skeptical with folks as to the choices they made even if they are perfectly reasonable and makes sense to get a sense of you know why someone chose to use one method versus the other and a perfectly acceptable answer is this is what fit into the into the time allotted for the exercise and I think you know if you're a professional soft you know professional data scientist and you work in a business that's a perfectly reasonable answer and so you know that's you know just one example of how we do that and then the next the next meeting that you have is we talk about how you know how you might think about kind of expanding this into a larger project you know this is a kind of toy example of something so how we actually think about building this into a product that was designed to do this and we we actually have folks at the company play the role of a customer who would be asking questions about this product and then a software engineer at the company who would who would think about production izing system and then we do have our own you know kind of culture fit interview and we talk about our company values and and how someone might think about what what matters to them and working at a company and then then we mostly just listen we we bring folks around and let them ask questions of me or of anyone else in the company and we really try to you know give someone a sense by the time they've left the interview they really know what it would be like to walk in day one and start working then the last thing I'll say on that and really not you know not to kind of go through at length at the process but if we do alternately get to a place where we want to extend somebody an interview I think it's really important that people know what they're gonna be working on when they get here right if you're hiring someone then you can't articulate in at least bullet form the work that you want them to do when they get here you know why are you hiring them do you not have work for them to do so we actually will send someone with their offer letter a list of tasks will say you know here's here's the stuff we'd like you to work on in your first week when you get here here's the things that we think you'll probably grow into working on in your first safe 30 or 45 days and here are some big projects that we'd really like you to be part of and say the first quarter or you know four to six months of you working here and and that's really work well for us that that kind of end-to-end process is a this is what it would be like to work at alluvium oh it sounds like you've developed a very thoughtful approach to recruitment and spend a lot of time perfecting it as well there are a lot of things that that spring to mind there that are firstly the the fact that you have code review that's that's fantastic because not enough places do write but also the fact that you get people in to talk about it and the fact that communication is such an essential aspect of the process the recruiting process and not just communication like see how people talk in terms of culture fit but talking about code explaining their own code and also the point you made of explaining code around the constraints that you have in a business whether it be time or or or and time is you know our most important resource the last thing that really sprung to mind when you explained the recruiting process is how much empathy that there's involved towards some people applying for for the job the fact that you're not only attempting to discover a lot about them which you do in this process but that they're actually discovering with whether this is the right job for them so what are the biggest challenges facing data Sciences as a holistic discipline these days and as a career path yeah so you know we touched on it a little bit before I think the you know there there there remains this challenge of this kind of myth of the unicorn or the rock star or the superhero I think that it is a persistent problem and in fact it may I may be somewhat biased now because I you know are working in a set of verticals that are that are sort of coming coming of age to a certain extent or at least entering they're kind of adolescents in thinking about data science and so they're kind of starting from a position where a lot of more mature industries were you know five more years ago where they were really focused on finding us specific data scientists but that is a problem you know we don't want people focusing on finding a perfect person who hits all of these different dimensions I think you know so I think the one that actually affects all industries kind of equally is there really isn't a kind of theory of management or theory of product development for data science yet you know on the one hand we have lots of really good theories of product development for you know software products you know we have agile and we have we have scrums and we have and then some companies do it with they have waterfall releases and they have all of these you know there's lots of great tools and lots of great theories of prong development that you can use if you're building a piece of software we really lack that for a kind of professionally designed data science software product there's lots of competing opinions there's lots people will borrow things from from different theories of product development and they experiment with them and then that's really important and I think it makes sense that we're at that kind of experimental phase but I also think we've we sort of been through this process long enough that there's a lot of people who were have been data scientists for the last four or five years and so the question is you know okay what do we do next what are what do data science and managers do what do they care about how do we measure their success how do we measure the success of their teams we don't have we don't really have a good set of metrics and a good theory for that and and that only serves to hurt us because we we don't want to lose folks and we certainly don't want companies and industries to be to lack success because we didn't spend enough time thinking about how do we how do we bring up and make the next set of data Sciences team's most successful for thinking about professional development in that light is is really interesting because I've noticed this in in other aspects of data science as well though in terms of professional development and support for junior data scientists for example yeah actually that's a great point right I mean I I don't actually have the exact data on this which is sort of shameful given the context but you know you almost never see a job opening for a junior data scientist right explicitly defined you know you could probably do a scrape of indeed or any any other greenhouse or any of these websites and my guess would be a huge majority of data science post would explicitly require you know three to four years experience so what does that mean how do we you know how does anybody get started does everybody have to be an intern or does everybody have to have some academic credential that is roughly equivalent to that experience even though we know in practice that it's not that's really problematic is I even think about you know this is this is a mistake that we've made it alluvium and it wasn't until more recently that I sat down with the folks on the data science team and started talking about this that we really decided to break this up and and recognizing that despite the fact that on paper someone might not be you know ready to contribute at a high level immediately there's a tremendous amount of value in bringing in someone who's really smart has an aptitude for this stuff and can learn directly from us how we think how we think that that process should work and I only say that I wish that that more you know folks who were hiring data scientists would think about that yeah that's right and it seems as we've discussed at length that alluvium is a place where you're learning on all fronts constantly yes for better or worse so what does the future of data science look like to you I think the future is quite bright I think you know the the future of data science in in some sense may follow the path of lots of other technical careers you know we've mentioned it a few times I think you know to take it take a different example you know how many folks do you know are running around these days with the title webmaster right you know there's how many webmasters are there you know out in the world who were kind of the single points of failure for some big enterprise website hopefully very few but there's still lots and lots of folks are running out the title data scientists who are kind of the single point of failure for all data analytics in an organization and so for the future to be successful you have to start breaking that up and thinking about what are the core competencies of a company that has data and data analysis at the core of its product obviously data science is going to be part of that but we have to figure out you know again what do you mean by data science is what kind of work is that person doing I think there's also a huge amount of work to be done in terms of you know development tools and again development methodology is for doing that and you know there's a there's there's emergent titles like data engineer and people who are explicitly focused on making sure that the data scientists have the right data at their fingertips at the right time to be able to ask and answer those questions that that make a difference for the business but I think there's also a you know this kind of emerging and related set of skills around kind of DevOps and development ops for for data science it's I know what does it mean how do we keep data data-driven systems running and how do we instrument them and measure them and make sure that they're working properly right how do we actually know that one model in production is meaningfully performing better than another and who actually builds that stuff right because it's not a data scientist it's somebody else and as we already mentioned you know who is the person who's actually managing this team and what is their career path look like and how do we measure their success so as a final question I'd like to know if you have a final of action for endlessness yeah so you know I'll say if the if the topics that we mentioned in terms of what we work on at alluvium or that interview process sounds appealing to you you know please drop us a line alluvium io slash careers where we're actively hiring for for data scientists both at the kind of you know mid and senior position as well as for the entry-level folks and then a whole swath of other opportunities for back-end engineers and DevOps engineers and and product engineers so that's definitely one and the other one that is if you're if you're in the very beginning of your career or if you're if you're not even ready for a career yet you're a college student or you're thinking about making a career shift take one thing that I think I did really well when I was in your position and just start kind of writing and talking publicly about what you're doing a lot of folks talk to me and say oh you know I'm I'm not a I don't like public speaking or I don't like putting myself out there I'm kind of introverted I think the best hack for an introvert is to actually get yourself out there because then you don't have to spend a lot of energy going and talking to people because they'll come and talk to you so you know start a blog get out you know volunteer to speak at a meet-up maybe it'll be a little scary the first time you do it but it'll give you a sense of kind of what the opportunities are and you'll build a network and you'll meet people and I think you'll have a lot more success I couldn't agree more and you'll notice how your approach and your ideas and your conceptions change when you try to formulate them to communicate to others absolutely you know sort of old old platitude is you don't really know something till you have to teach it and and you know this sort of a baby step to that is you don't really know something they have to present it at a meet-up I like it and I will put a link to your the alluvium careers page in the show notes also great thank you awesome drew it's been an absolute pleasure having you on the show that was pleasure was mine you know thanks for the great conversation thanks for joining our conversation with drew about how to build data science teams along with the unique challenges of building data science products for industrial users we saw Drew's vision of building data science teams as a set of individuals who collectively cover all aspects of his data science VanDyke and can communicate across it to go from data to insights and actionable x' we also got tremendous insight into eluvians recruitment process which reflects the job itself as much as possible on top of this we saw just how much eluvians work of building data science products for industry requires a combination of Welsh human data science stack tools and new methodologies to deal with streaming data that is higher in volume than that created by Twitter on a daily basis make sure to check out our next episode a conversation with Mara avrech data nerd at large and tidy verse development advocate at our studio Mara and I will talk about exactly what it means to be a data nerd the role of data science in sports data for social good civic tech and the role of data science paradigms such as the tidy verse in the data science ecosystem as a whole do not miss it I'm your host Hugo von Anderson you can follow me on twitter at hugo bound and data camp at data camp you can find all our episodes and show notes at data capcom slash community slash podcastin this episode of data framed a data count podcast I'll be speaking with drew Conway world renowned data scientist entrepreneur author speaker and creator of the data science Venn diagram drew and I will be talking about how to build data science teams along with the unique challenges of building data science products for industrial users how does drew now view the Venn circles he created those of hacking skills mathematical and statistical knowledge and substantive expertise when building out data science teams stick around for this and much more to set the scene the first half of the show will focus on what data science looks like today through the lens of the evolution of the data science Venn diagram and the unique place data science holds in an industrial setting the second half will use all of this knowledge to focus on data science team building and recruiting I'm Hugo von Anderson a data scientist that data camp and this is data frame welcome to data frame a weekly data count podcast exploring what data science looks like on the ground for working data scientists and what problem is it consult I'm your host Hugo von Anderson you can follow me on Twitter as you go down and data cam at data count you can find all our episodes and show notes at data camp comm slash community slash podcast hi drew and welcome to data framed hey Hugo it's great to be here it's great to have you here and I'm really looking forward to chatting with you about how to build a data science team along with the unique challenges of building data science products for industrial users but first I'd like to find out a bit about you what are you known for in the data science community right so a long time ago I was probably known for being one of the earliest bloggers and data science so as the story goes when I was first admitted to graduate school NYU I was really excited about finally having the opportunity to speak publicly about you know the interest that I had and some of the work that I was doing so I started this little tiny blog called zero intelligence agents and and really just used it kind of as a public notebook the things that I was working on you know code that I was writing things that were interesting to me and then eventually kind of combined that with very early social media days on Twitter and and found that there was this small community of folks who were you know writing and tweeting about the interesting data stuff that they were doing even predating a kind of generally agreed-upon term of data science and then of course the thing that I'm actually most known for in the data science community is the data signs been diagram of course and then you were heavily and you've been heavily involved in the NYC data science community as well right yeah and so the you know probably the thing that that I'm most proud of in terms of my contribution back to the data science community has been my ability to or at least my concentration from being in graduate school at NYU and then and then building companies and building teams in New York is really kind of planting a flag for New York City as a great place to be a data scientist and great place to do data sighing yeah and you actually have a strong argument that New York City was a place where data science was being done even before data science was a discipline right yeah that's exactly right I mean you know we think about what the anchor industries have always been in New York and you know you can take tech out of that and just think about the financial services and banking industries the media industries and the advertising industries and all three of those are really data centric and so I think what what we see now in New York City is is that it's it's sort of always been this place where data and the robust analysis of data has been central to business and to profit-making and now with technology and its dissemination and movement into other verticals those industries themselves you know financial services and and media and ad tech have become a big part of what makes New York City unique and and then beyond that now I think with the university system and the the amount of startups that are in the space New York has become a really great place to do this kind of work I think there's a bunch of other things that are really unique about New York that that tend to enhance that one of which is just the geography of the city right and you know it's easy to always make comparisons to the West Coast but you know New York City is a tiny little island and we're all we're all crammed on it and if I want to go have a conversation with you I can just jump on the subway go downtown and there you are and if I want to go to a meet-up I can go there I can go speak to a professor at a university and everything is within you know five or six square miles of where I'm sitting right now and I think that really has changed how people in in this city can can do work that's right and there's such a strong sense of community around data science here agreed in and I think again part is just kind of the the culture of New York I mean people you know for better or worse you know we're loud we're brash we like to talk about what we're doing and that means that that a lot of ideas get shared that makes all the difference in the world and you've spoken to a number number of topics say that will we'll get back to your early days as a as a blogger really contributing to the evolution of data science and and and defining it initially as a career the role of community our data science in in New York City as well and also the famous Venn diagram so I thought maybe you could tell us a bit about that the Venn diagram I'm sure you don't want to talk about I'm sure you've been talking about it for years so I'm sure you don't want to talk about it too much but where where it came from and where you see its evolution has gone since then yeah sure so the origin story of the Venn diagram is is actually intimately tied to the the data science community in New York City you know about I guess a little less than than ten years ago when I when I first arrived in New York I kind of inserted myself into what was then this again kind of nascent community of folks in academia and Industry and the startups who were doing this work and so eventually there formed a kind of not an almost working group of folks who every month we would meet for a potluck brunch at the top of the New York Times building in the R&D floor which is all the way at the top of the New York Times building and just sit around on a Sunday morning and kind of talk about what this data science thing was you know we would have topics around you know how would you think about teaching it and you know names that are that I think are now basically associated with data science or most venerable names were there just thinking about it so you know folks like mark Hanson and Chris Wiggins from from the academic side and Hilary Mason and Mike the kiddies on the on the commercial side you know I was there and when was this this was back in you know 2009-2010 we were having these these morning conversations and you know sometimes they're mostly just for fun I mean we were all friends we knew each other from from various walks of life and we would just come together and chat and so you know one at one day we were having this conversation around you know what is data science like how would we think about defining it what are the requirements to be a good one and we have this wonderful conversation you know Chris Wiggins and Hilary Mason were we're kind of leading the chat and and I kind of walked away from that that discussion with a whole bunch of ideas in my head as to you know well okay this is this is what these guys think and here's how I might interpret this and so the following week this is when I was still in graduate school so on one particular class I sat all the way in the back of the lecture hall and just opened up my laptop and started kind of thinking about how I would define data science based on the ideas that that had been discussed at this potluck breakfast which ultimately led me to you know firing up my you know open source illustrator and creating what is now the data science Ben diagram and that went and then ultimately wrote a blog post about it that went as phi role as a data science post could go viral in you know circa 2010 and will definitely link to that blog post in the show notes and of course in the middle of that Venn diagram you have data science but maybe you can tell us the things that revolve around it that a necessary skill yeah so you know the the the central des part of the debate I think that the that we were having back in 2010 and honestly it seems like many folks are still having although it you know it's shattered into many more dimensions now is you know what are the constituent pieces that some person should have if they want to actually be a data scientist and so you know I broke this into three big groups one is you have to be competent in using and developing software what I refer to as hacking skills and what I really meant by that is this is not someone who is a professional software engineer hacking skills means someone who is you know able to fire up the command line can manipulate text knows how to work with a scripting language so that they can produce repeatable maybe shareable and reproducible pieces of code that could be used to analyze data you know again there there wasn't a sense of professional application it was just you know do you know some stuff can you actually code code enough to be able to build kind of an MVP of something the other piece of it was kind of the academic side so if you're going to be building these things you should have some real kind of grounding in the statistics and the mathematics that go into the models and the methods that you're using alright if you don't have that then you may then you may simply be kind of pointing a very powerful technical weapon at data and not actually know what's going on and then the third piece which I think ultimately becomes kind of the the the glue that brings it all together is what I called substantive expertise or really kind of subject matter expertise and this has nothing to do with your skills as a coder or your competency as a statistician but more do you know how to ask good questions right because at the end of the day and again thinking about this back in 2010 ultimately what I real what I was observing how in the in the kind of intellectual marketplace so to speak is that there tended to be a lot of people or most people were good at coding and a lot of people had or could get training in statistics in math but they didn't really know how to ask good questions and if you don't have a kind of point of view on a problem or point of view on a data set then you're kind of starting with nothing because no matter how much data analysis you do if you're asking the wrong questions you're kind of just you know treading water and so we combine all three of those to create data science and of course there's the the kind of secondary overlaps that occur between all of them and and I think the one that for a lot of people was was most satisfying was you know people were trying to make this distinction between well as data science machine learning and what's the difference and so what seemed obvious to me is if you know if you have hacking skills and you know about statistics and math you put those together that's really what machine learning is and certainly was those many years ago and so we you know we kind of built that and I wanted to balance that with what I viewed again at the time as a PhD student as what kind of traditional research is right so if you have this methodological grounding in statistics in math but you also have substantive expertise as always in a political science department so people who have you know who are working on American politics questions or looking at international relations and conflicts questions they know a lot about those subjects and oftentimes they can apply specific mathematical to try to estimate what some of what what they're seeing in data but that's not data science that's traditional research and so I kind of had this overlap of traditional research and then maybe the the other overlap that was that was I don't know if it was controversial but but people had a lot to say about it was his idea that if you had substantive expertise that is he knew you knew a subject well enough to ask good questions but then also we're able to kind of hack and write code and get some answer if you didn't know the statistical and mathematical rounding and what those answers meant then you're in the danger zone right often referred to this as kind of you know you knew enough to be dangerous and that's kind of the worst place to be because then you could create very misleading results and I was trying to use that as a guard against what what I would hope would be folks not you know putting data science on a path to you know snake oil yeah absolutely and I actually that the danger zone is is very interesting to me I love that you put an exclamation point at the end of it to draw even even more attention to it but I think the fact that the idea that people can especially with you know the the raging success of all these new fantastic api's that allow people to fit and predict a variety of models after importing data without necessarily knowing a lot about the models they're they're using it is actually incredibly dangerous yeah and I think you know there's there's there's a whole bunch of dimensions to this right I think you know take this take adverse from a like-kind even from a commercial perspective the the thing that I found most interesting about how data science tools and even platforms and applications have grown over the course of the last you know five to ten years for a while there there was this real attraction to say building data science in a box tool so it's like you know take your data set and stick it into this tool and it would it would predict for you all the possible outcomes for something and like wallah there you have a you know a you have a model you can put that model in production and it's great and so these tools were targeted people that had substantive expertise right so I was like I can build this tool and I can go sell it to someone in an insurance company and that insurance company will have better actuarial tables it doesn't matter that the you know the folks in the insurance company might not exactly know what my random forest is doing and why it's making those distinctions it just matters that they're getting better results then I remember being around and hearing about a lot of those companies and quite honestly at the time thing it seemed like a reasonable idea but having this kind of sinking feeling that they were building tools that were kind of in this danger zone realm and what my observation has been since then is that the reality is that a lot of those tools don't really fit into a real use case right they're sort of in this valley between two real use cases one being okay you have you know no hacking skills or methodological skills and your primary tool is like Excel and you're good at making charts but you don't really you don't really know how to how to build things and so you have a specific kind of service-oriented need and then there's the other folks who are actually are really good at the substantive stuff and the and the mathematical stuff and so they need they need really granular tools to do their work and so they're the ones that are going to ultimately probably learn or or learn Python and become data scientists and then but there's nobody really in between and so ultimately I almost saw tools being built that fell into that danger zone and ultimately didn't have a lot of success yeah and I think that also speaks to the fact that pursuing let's say accuracy or model performance at the expense of other other qualities is also inherently dangerous right and now we're seeing kind of a rise in a desire need in society for machine learning interpretability which brings us back to more substantive expertise I think that's a really good point you know there's these things can to Heaven flow and so there's there's kind of this natural I think early attraction to black box tools because in a lot of this I think follows almost from a lot and sometimes almost the negative downside of the Venn diagram or folks kind of viewing these holistic definitions of things as being these kind of almost unicorn like individuals and so what that does is it casts a shadow over the discipline that says okay well only a very specific kind of person can do this somebody that has all of these things and if you don't have that well then you need a very specific tool to do it and of course you know history of data science and misery of many technical craft has really you know is not about finding one person who does everything right it's always about finding a group of people who know about who know a lot about some parts of it and can work together and so I think there's always as natural early inclination to say okay I can build I can build this tool that does this thing really well and I will try to find somebody to use it and again I think the the results in the market have not been not been great for those approaches for sure and you've actually preempted my next question in some sense I was gonna play devil's advocate and ask you whether you thought the entire Venn diagram was a danger zone in in itself in terms of it its potential for being misinterpreted and the search for the Unicorn yeah I think you know certainly that was I guess it is still true if if there's one thing that I wish I could have been clear around when I introduced it is that it wasn't you know it's called the data science Venn diagram it's not called the data scientist Venn diagram and I think a lot of people you know when they looked at it they said oh this is these are all the skills I need to hire for if I'm gonna hire a data scientist and really the idea that the diagram is that this is what the discipline is right and so if we think about other disciplines like software engineering we don't think that there's a canonical software engineer that does everything right and again I think it's the same for data science and and unfortunately in those days you know in to some extent still today I think people view these kind of holistic definitions and as you know the Venn diagram I think is become useful shorthand because it's you know it's an image it's easy to share but there's still a lot of you know pixels in ink that gets filled trying to holistically define what the what this career path is and ultimately I think that's that's that's in some sense a waste of time because we sort of know we know this movie's gonna end we've seen it many many other times and so we should be thinking about it in the context of teams and how people work together yeah absolutely and that's something we'll get to the the future of of data science and I will you know make it clear that your Venn diagram doesn't say data scientist in the middle it specifically says data science and in fact I mean it's not necess surely just a unicorn I think you had a great slide at Jared Landers conference nyr which had a unicorn with a cat with a laser gun already right yeah yeah and I wish I could credit whatever the artist was that created that because I think it's wonderful yeah I mean you know if I recall from that talk you know what I was thinking back on then was was actually part of these early days and you know kind of right at the turn of the decade in kind of 2010-2011 where I was in New York City I was having lots of conversations with people at various companies from early-stage startups to you know fortune 500 companies and when they heard that I was a data scientist or that I could you know help them find data scientists the the punchline of that joke is you know that is that is who they thought they were meeting with right this this cat riding a unicorn with a handgun and a you know flamethrower or something and of course we know that's not true and and the the further the further we get to a professionalized disappoint of data scientists the further that becomes true I couldn't agree more and something that I'm looking forward to chatting about later in this in this episode is about how you'd go about building a data science team from this from this Venn diagram but before we get there I'd like to know a bit about what what you do these days what do you spend most of your time doing I do almost no data science in fact so I'm the founder and CEO of a company called alluvium what we do is we build data data and products for men and women working in complex industrial settings so what I do most of the time is I think really hard about what their problems are and in particular how we how we can build tools that help them better leverage data to make decisions what that means is I spend a whole lot of time listening to our customers and asking them those kinds of questions of course I also spend a whole lot of time listening to my team you know my teammates and answering their questions and learning about what kind of techniques you know pretty on the data science side they think would be most applicable to solving these problems and then I also have an opportunity to speak to two folks who might want to think about working for us and so I do I still do a fair amount of recruiting and thinking about you know how to best explain what we do to folks and and how to get them excited about working with us fantastic are you currently hiring oh yeah yeah yeah that was gonna bet that'll be my call to action at the end of you kind of fantastic and so it actually sounds like in some respects you're acting as an interface between the substantive expertise of the industries that you're working with and the hacking hacking skills and and mathematical and statistical knowledge actually that's a great way to think about it you know when when I found it alluvium there was there was no no question in my mind how little I knew about the day-to-day lives of someone you know working in a in a in a power plant or working in an oil refinery but in is in fact one of the core values at alluvium is is about learning and learning firsthand we call seek the first-hand we always want everyone in our company to think about how they can go out and through first-hand knowledge learn about something new and in particular learn about how our customers do their work and and so that's substantive expertise and how industrial operations work how they you know what kind of data gets generated how that data how that data generation process gets instrumented who the actual people who are you know standing on the front lines making decisions with that data who are the people standing in the control room who are observing that data and you know who are the folks back in the headquarters building making business decisions from that data we we want to seek and learn about all of that work so that we can go about building products that actually you know support them in their day-to-day yeah and if I recall correctly your web page says that a lot of data scientists will put on hard hats and go out there in the field oh yeah and not just the data scientist and you know the whole team gets out there we have yet to get alluvium branded hard hat so we're often relying on our hosts to provide us with them but it's it's probably one of the most exciting parts of the job a year ago we went to the the large recycling center out in Brooklyn near in New York fascinating to see how the city of New York Campbell handles its waste and how they how they try to improve the efficient use and recycling of that and then this past winter we we went to a robotics consortium in the Navy Yard and learned about how they're using robotics for for art and for industry and for and for startups as well and I think you know learning has become such a core part of how we do our business that I'm I'm always excited to get a chance to go out and see how people do their work it sounds like an incredible opportunity and when you said you haven't got alluvium branded hard hats yet I just wondered whether you've tried to put alluvium branded laptop stickers on them at any point yeah it's a good idea well you know the nice thing is that I guess we could just you know maybe your your your pre-empting me or we could just buy the hard hats and then just put the stickers on that's that's the easiest thing to do yeah exactly so I'd love to talk a bit more about alluvium and I'd like to kind of motivate my question by by quoting you are paraphrasing you you've said that much of data science is a stack of tools developed to deal with big data and designed for the web but what can data science do for non-digital industries so that's really to frame my question which is what are the major challenges that you're trying to solve with your work at alluvia yeah so you know to kind of go back a little bit to that context when I founded the company I was I was coming off of having worked at a at another startup here in the city which which was which was a consumer health company and we were building a product where we were trying to kind of use real-time streaming physiological and telemetry data to help people understand kind of their their overall health and and part of what really attracted me to that opportunity is is actually working with streaming data from the real world in the early early part of my career I'd worked in the international security world and field and I dealt with a lot of data from sensors whether it was telecommunication sensors or measurement and signals out in the field and then you know using that and combining it with highly unstructured data like text reports or images of maps and things like that and and one of things that really stuck for me with from that early experience is that even in those days in those days I mean you know kind of mid 2000s when a lot of what we think of now is this kind of commoditized stack of big data tools really didn't work well for dealing with that that we just had to develop a lot of ad-hoc methods for dealing with that data and so fast-forward to you know where I was working for our two alluvium I I returned back you know almost a decade later to find that it's mostly still the same right that we it's we and by that I mean we as a kind of technology and data community had had seen that there was a ton of value in you know web block server files and search results and clickstream data and all these things that were were produced by and used with in kind of digital platforms but we hadn't thought a lot about how do we how do we do the same sort of stuff with data that's generated outside the web right it's sort of like these these physical systems are so complex and there's so many things that are hard to observe and we also have poor ways of measuring them and we also don't have good software tools for dealing with them right it's sort of like example that I like to say is it's still really hard to predict the weather and you know more than a day out right and part of the reason for that is you know the earth is an extraordinarily complex system and we don't really have good ways of measuring it and we certainly don't have good ways of measuring it and doing analysis on it so when when I had the opportunity to start thinking about my own company and the kinds of problem that I would want to solve the thing that I realized is that this technical problem was still highly present that there was just not a good way of doing kind of distributed real-time unsupervised learning from data from these kind of physical sensors in any unified way right if you had multiple assets across multiple physical locations and you wanted to have a kind of uniform view of how all those things were operating and you want to do that learning without any training data how would you think about doing that so that was kind of the technical spark for a lluvia m-- and then ultimately the you know the the founding spark granted the commercial spark for it was was realizing where that problem was most acute so kind again reaching back to this kind of seek the first hand idea I just got out and started talking to folks and quickly it became clear to me that you know the industrial space which has for hundreds of years been in a highly data-driven set of industries whether it's the oil and gas industry the manufacturing industry both for you know discrete and process manufacturing all of which were really good at collecting data but none of which had really matured and in what they would refer to as kind of their digital transformation right these are these are processes and systems that are still in many cases highly analog and even when heavy investments have been made in in generating data there's just not a lot of good tools for doing anything with them so those two ideas kind of slam together then we got to work could you give us an example or a case study of some of the work you've done sure so probably the the best example that I can think of that more sort of you know able to talk about right now was actually an early early pilot that we did with the New Orleans Police Department so this actually Ulta ultimately ended up not being a path that we decided to take commercially but at in the early days we were really interested in what our technology could do say in inside a vehicle you know modern car is basically a motorized computer and so it generates a tremendous amount of data but you want to be able to have both a kind of local view of what wow that vehicle is operating and then kind of a global view and the way that we talk about that at alluvium is through this idea of stability and stability kind of forms the central you know not only language that we use around our products but you know in some sense kind of the core value proposition of the business we right we want we want to provide our customers with a view of the overall stability of their operation and then when those things change we want to be able to quickly alert and guide an operator to where in a system that instability may be coming from so that they can very quickly make an evaluation of that and ultimately take an action if they need to and so in the case of the police department we built a prototype in a pilot for them based on vehicle operation the idea was we wanted to be able to to show how police vehicles were operating in the city and actually do putting software inside the vehicles on the on the vehicle laptops to stream data from the the the OBD sensors the onboard diagnostic sensor which has you know a huge amount of information that you can draw from it to give a kind of global view of the stability of the the vehicles out in the field and so you know we built that ultimately we decided that there was much bigger opportunity in plants and factories and that's where we we focused our attention but we were able to build this this this prototype in this this pilot for them we're able to see you know changes in vehicle operation how that changes stability and we can ultimately you know produce some real interesting insights great so what essentially we're also talking about is not data science standing alone by itself but actually building data products as well yeah and that's that's kind of the whole gig right I think we have a particular point of view at alluvium that you know data science machine learning AI whatever whatever you want to call it it only takes you so far right ultimately if you're building a product that is there to support someone making a decision then you need to think about where is the point in which their knowledge their context their expertise need to take over and ultimately make some you know decision adjudication based on what you're presenting them I often you know when I'm introducing what we do to you know say folks in in the industrial space and or potential customers we kind of talk about this tension between data discovery and then data reasoning or reasoning about data and so you know kind of put yourself in the role of an industrial engineer who's standing inside a refinery alright they are basically beset on all sides by this kind of wave of information I think I forget what the exact statistic is but I think you know the average oil refinery will produce more data in a day than you know all of Twitter right in the same time period and so if you're that person standing there and your job is to mitigate any any problems and to track how this process is working there's just no possible way that you you as an individual or even a highly competent and highly trained team of mechanical and industrial engineers could could could do that high dimensional math problem in their heads or even use tools to do it but a computer is really good at that right a computer is really good at taking in lots of information performing lots of of analyses on it and applying lots of dimensionality reduction methods to that data to try to identify you know what our overall or systematic changes in it and so we believe that well-designed data tools should really be pulling the cognitive responsibilities away from this data discovery to data reasoning because computers are really bad at reasoning about data and they don't really know why something changes they might just know that it does change but a person radicular Lee a highly trained industrial engineer knows exactly why something might be changing if they're presented with the right information at the right time and so we think about you know what is the equilibrium point or the perfect optimal point in which we can kind of handoff and automatically generate it finding about these kinds of changes to an operator who can quickly you know move through that information and make an evaluation or take an action and then based on that action have the system learn from that and get smarter and get better at identifying important changes or as it changes that aren't important because at the end of the day we want that experience for that human to be as good as possible because we really want to respect their time because you know didn't justify nothing I'll say in this is that you know our customers are a little atypical for data science products because I don't really care about software at all right sometimes software is like the thing that they have to do or the thing that they go to when they really need help with a very specific kind but their job is to run a plant and that is a that is a physically intense job not one that that typically requires staring at a computer screen for very long but when they do look at a computer screen we want that to be a really high value interaction exactly and software is a tool to help them answer questions and to get deliverables exactly and it's in particular it's a tool that they're that they tend to be pretty love right so you have to really be able to show value quickly yeah I mean I've been doing this longer than self weighs being around right exactly yeah and I you know one of the things I talk about with the team is you know if you even generously if you think about all right well what is the history of kind of modern Big Data as we think about it today you know it's roughly ten years old maybe fifteen years old if you assume that you know Google and Yahoo were developing these things four years before they were released but folks who've been working in the industrials place have seen hundreds of years of technology revolution change the way that they do their businesses and so our little drop in the bucket barely makes it waves after a short segment we'll jump right back into our interview with Drew to use everything we've just discovered to focus our attention squarely on Drew's approach to data science team building and recruiting let's now jump into a segment called rich famous and popular with Greg Wilson who wrangles instructor training at data camp hey Greg g'day so Greg what do you have for us today well as a follow-up to our discussion last time about using empirical methods to guide the design of programming languages I'd like to see someone use data science to find actual design patterns in software if you haven't run into the term before a design pattern is something that comes up often enough to be worth giving a name but isn't precise enough to turn into a single general-purpose library the term originated in architecture for example I think most people know what a porch is but if you try to pin down a precise definition it turns out to be surprisingly slippery similarly programmers use terms like pipeline or plugin in more or less the same way to describe more or less the same high-level concept but there are enough different ways to turn those concepts into code that there's never going to be one definitive implementation that everyone uses so how can data science help well data science is pretty good at finding patterns so I think would be cool to see if we could use clustering techniques to find patterns in the way code is structured and used you see the design patterns we have now are all the product of experienced programmers eyeballing their which is great as a starting point but you know different experts will see different patterns or put particular instances into different clusters and so on if we throw the tools we have against hundreds of thousands of source files will we find the same patterns in how classes are structured what if we look at traces of those programs execution will information about when objects are created and how they call each other's methods and so on give us more insight I can see how you would get source code from somewhere like github but where will you get traces from running programs I mean I'm all for helping science but I probably wouldn't let you install something on my computer to look at what I was doing I don't think we'd have to go that far there are enough software projects out there now with decent test suites that we could look at how the code runs itself and we could start small if you look at the work of people like your misogyny ma there's actually a surprising richness in how single variables are used a loop index isn't the same as a state flag which isn't the same as an accumulator but you might need dynamic analysis to tell them apart I don't know and I think the only way to find out is to have someone take a crack at it on software engineering research is already doing this some but the work I've seen has taken the hand-rolled categories as a given rather than trying to validate them or discover new ones and I think we've learned a lot by having fresh eyes thank you very much Greg if anyone in the audience is interested in giving this a try please get in touch we'd love to hear from you thanks Greg once again and looking forward to speaking with you soon thanks you go after that interlude it's time to jump back into our chat with Drew Conway as we discussed earlier your work that you personally do is for a large part of it is building and managing data science team so I'd like to really get into that now so hypothetically I'm gonna give you a million dollars to build a data science team how would you spend a million dollars to build a data science team and what would the team look like you know it's interesting the the amount of money is is obviously great because I could I could hire a bunch of data scientists maybe not as many as I would hope with a million dollars but you know the first thing that I that I always talk to folks about when they're thinking about building a data science team is I do you need data science fantastic do you actually need a data scientist right there's I mean I think there's a lot of companies that are data companies that confuse that for I'm a data science company all right so there's lots of great examples you know some of the ones I like the best there's there's a lot of opportunity in the world for taking old difficult to navigate data sets and making them easier to navigate and and easier to draw conclusions from but does that need data science you know do you need a model do you need some prediction or some classification to be good at that probably not so you actually need to hire data scientists and so the first thing that I think about is okay do we actually need that now okay if if we've if we've convinced ourselves that we do then the first question you need the next question you need to ask is you know what what kind of data science is important to my company in my company that wants to do research and develop things at the cutting edge and and I'm willing to put a tremendous amount of resources and a tremendous amount of risk into commercializing academic pursuits well if that's true then you know there's a certain kind of data scientist that you would want to think about hiring and building a team around right folks who actually have experience in independent kind of basic research around methodologies and and data you know might look more like a statistician than an engineer and so thinking about that is really important on the other side and this is we'll be more relevant to you know your listeners or or certainly is more relevant to folks in businesses you know is data science core to your product is it the thing that that actually gets people using it and buying it and coming back is it is it that prediction is it that classification is it that finding that you can provide them with data that makes them use the product is if that's true then you want to think about hiring for a different set of skills you do need software engineers you do need folks who understand how to collaboratively build software and know the that there's always tension between kind of purity of results and functionality of something in a production system and then you know then the place where were we're happy to be talking even more is is that okay let's think about building a recruiting process that actually supports the answers to all of those questions I'd love to talk about that more in in just a minute I do think the initial question does the company need data science is is incredible cuz I was actually speaking with a data scientist that at Google a while ago who said to me at an analogy that stuck with me she said I might mess up the analogy slightly but she said a data scientist to a company is like a tiger to a drug dealer and I said I said tell me more and right and she said well if if you're a drug dealer and there's another drug dealer down the road who has a tiger on a leash you're gonna get a tiger on a leash and she said similarly if you're a company and your competition has data scientists you're gonna you're gonna want to get data scientists without actually thinking about whether I need one to to build up my team or to you know meet whatever metrics I'm trying to make I like that a lot I also need to get a tiger now absolutely the the other thing I I found interesting about what what you're saying is that we discussed earlier that um the substantive expertise in in your line of work at alluvium comes from your customers and also your data scientists on the ground that they get out into the field along with everyone else in the company as much as possible how do you think about hiring around the other parts of the Venn diagram I mean presumably you don't require that all your your data scientists have a like a really strong background in in the mathematics and statistics from the academic view nor do they have you know computer science degrees yeah we take our you know what I often say to folks and this is certainly true for alluvium is you know your recruiting process is the first product that you're likely to build to you know to kind of MVP and completion you know if you're hiring someone to do a job then the recruiting process should reflect as closely as possible what how they're going to work in that job every day or at least a close approximation of it right the recruiting process is a naturally asymmetrical event right you know and your team if you're bringing something in to interview them is going to get a ton of information out of that person as to how you think they may perform based on the questions that you ask them right that's entirely one direction the other person that you're bringing in to recruit depending on how you how you how you build that recruitment product will get little to no or a lot of information about what it's going to be like to do this job with you so if I have a recruiting process that starts with you know a puzzle moves to a whiteboarding session and ends with a culture fit interview and maybe some live coding that hasn't that reflects absolutely not at all what my job would be like when I get to get to work so what we do at alluvium is is is we we try to build the whole process start to finish basically as you as you might imagine going from zero to a piece of code deployed into our production system as a data scientist through the whole process right so the first set of the the the kind of first round interview is mostly a you know a get to know you a little bit learn a little bit about what interest you learn a little bit about why we might be interesting to you in particular what is it about this sort of unique intersection of industry data science and product building that's really exciting to you and then you know a little bit of a technical screen just to just have a little bit of a baseline to ask okay you know what kind of tools do you like to use now what are some examples of problems that you've solved with them and ten walk us through you know maybe a piece of software that you know you put into production or if you're just coming out of an academic position you know a piece of software that you used in an in a paper and how that process went and then we have a you know what kind of take-home coding exercise but the the coding exercise reflects exactly the kind of problems that that we would use and so you know in our case we actually you know put a lot of effort into building the exam it's a you know we even built our own streaming service it's kind of a stylized problem but it's based on a real problem that we worked on having to do with wind turbine data so we created a streaming service that streams and emits real data from a wind turbine and you're asked to you know kind of do some simple exploratory analysis and then try to try to predict some values coming off of that wind turbine and then you you know you submit that as a pull request to a github repo that we give you as part of this this exercise and then we as a team will look at the results of that and you know if we like what we see and we're interested to learn more about you and how you might fit in we enter we invite you for for an on-site interview and the very first inner you know the very first session in that on-site interview is actually a code review of what you did because if you're working for us and you submit code your code will be reviewed and if you're a data scientist that code will be reviewed both for the the actual technical code itself but also the methods you use and so we we like to we like to be pretty skeptical with folks as to the choices they made even if they are perfectly reasonable and makes sense to get a sense of you know why someone chose to use one method versus the other and a perfectly acceptable answer is this is what fit into the into the time allotted for the exercise and I think you know if you're a professional soft you know professional data scientist and you work in a business that's a perfectly reasonable answer and so you know that's you know just one example of how we do that and then the next the next meeting that you have is we talk about how you know how you might think about kind of expanding this into a larger project you know this is a kind of toy example of something so how we actually think about building this into a product that was designed to do this and we we actually have folks at the company play the role of a customer who would be asking questions about this product and then a software engineer at the company who would who would think about production izing system and then we do have our own you know kind of culture fit interview and we talk about our company values and and how someone might think about what what matters to them and working at a company and then then we mostly just listen we we bring folks around and let them ask questions of me or of anyone else in the company and we really try to you know give someone a sense by the time they've left the interview they really know what it would be like to walk in day one and start working then the last thing I'll say on that and really not you know not to kind of go through at length at the process but if we do alternately get to a place where we want to extend somebody an interview I think it's really important that people know what they're gonna be working on when they get here right if you're hiring someone then you can't articulate in at least bullet form the work that you want them to do when they get here you know why are you hiring them do you not have work for them to do so we actually will send someone with their offer letter a list of tasks will say you know here's here's the stuff we'd like you to work on in your first week when you get here here's the things that we think you'll probably grow into working on in your first safe 30 or 45 days and here are some big projects that we'd really like you to be part of and say the first quarter or you know four to six months of you working here and and that's really work well for us that that kind of end-to-end process is a this is what it would be like to work at alluvium oh it sounds like you've developed a very thoughtful approach to recruitment and spend a lot of time perfecting it as well there are a lot of things that that spring to mind there that are firstly the the fact that you have code review that's that's fantastic because not enough places do write but also the fact that you get people in to talk about it and the fact that communication is such an essential aspect of the process the recruiting process and not just communication like see how people talk in terms of culture fit but talking about code explaining their own code and also the point you made of explaining code around the constraints that you have in a business whether it be time or or or and time is you know our most important resource the last thing that really sprung to mind when you explained the recruiting process is how much empathy that there's involved towards some people applying for for the job the fact that you're not only attempting to discover a lot about them which you do in this process but that they're actually discovering with whether this is the right job for them so what are the biggest challenges facing data Sciences as a holistic discipline these days and as a career path yeah so you know we touched on it a little bit before I think the you know there there there remains this challenge of this kind of myth of the unicorn or the rock star or the superhero I think that it is a persistent problem and in fact it may I may be somewhat biased now because I you know are working in a set of verticals that are that are sort of coming coming of age to a certain extent or at least entering they're kind of adolescents in thinking about data science and so they're kind of starting from a position where a lot of more mature industries were you know five more years ago where they were really focused on finding us specific data scientists but that is a problem you know we don't want people focusing on finding a perfect person who hits all of these different dimensions I think you know so I think the one that actually affects all industries kind of equally is there really isn't a kind of theory of management or theory of product development for data science yet you know on the one hand we have lots of really good theories of product development for you know software products you know we have agile and we have we have scrums and we have and then some companies do it with they have waterfall releases and they have all of these you know there's lots of great tools and lots of great theories of prong development that you can use if you're building a piece of software we really lack that for a kind of professionally designed data science software product there's lots of competing opinions there's lots people will borrow things from from different theories of product development and they experiment with them and then that's really important and I think it makes sense that we're at that kind of experimental phase but I also think we've we sort of been through this process long enough that there's a lot of people who were have been data scientists for the last four or five years and so the question is you know okay what do we do next what are what do data science and managers do what do they care about how do we measure their success how do we measure the success of their teams we don't have we don't really have a good set of metrics and a good theory for that and and that only serves to hurt us because we we don't want to lose folks and we certainly don't want companies and industries to be to lack success because we didn't spend enough time thinking about how do we how do we bring up and make the next set of data Sciences team's most successful for thinking about professional development in that light is is really interesting because I've noticed this in in other aspects of data science as well though in terms of professional development and support for junior data scientists for example yeah actually that's a great point right I mean I I don't actually have the exact data on this which is sort of shameful given the context but you know you almost never see a job opening for a junior data scientist right explicitly defined you know you could probably do a scrape of indeed or any any other greenhouse or any of these websites and my guess would be a huge majority of data science post would explicitly require you know three to four years experience so what does that mean how do we you know how does anybody get started does everybody have to be an intern or does everybody have to have some academic credential that is roughly equivalent to that experience even though we know in practice that it's not that's really problematic is I even think about you know this is this is a mistake that we've made it alluvium and it wasn't until more recently that I sat down with the folks on the data science team and started talking about this that we really decided to break this up and and recognizing that despite the fact that on paper someone might not be you know ready to contribute at a high level immediately there's a tremendous amount of value in bringing in someone who's really smart has an aptitude for this stuff and can learn directly from us how we think how we think that that process should work and I only say that I wish that that more you know folks who were hiring data scientists would think about that yeah that's right and it seems as we've discussed at length that alluvium is a place where you're learning on all fronts constantly yes for better or worse so what does the future of data science look like to you I think the future is quite bright I think you know the the future of data science in in some sense may follow the path of lots of other technical careers you know we've mentioned it a few times I think you know to take it take a different example you know how many folks do you know are running around these days with the title webmaster right you know there's how many webmasters are there you know out in the world who were kind of the single points of failure for some big enterprise website hopefully very few but there's still lots and lots of folks are running out the title data scientists who are kind of the single point of failure for all data analytics in an organization and so for the future to be successful you have to start breaking that up and thinking about what are the core competencies of a company that has data and data analysis at the core of its product obviously data science is going to be part of that but we have to figure out you know again what do you mean by data science is what kind of work is that person doing I think there's also a huge amount of work to be done in terms of you know development tools and again development methodology is for doing that and you know there's a there's there's emergent titles like data engineer and people who are explicitly focused on making sure that the data scientists have the right data at their fingertips at the right time to be able to ask and answer those questions that that make a difference for the business but I think there's also a you know this kind of emerging and related set of skills around kind of DevOps and development ops for for data science it's I know what does it mean how do we keep data data-driven systems running and how do we instrument them and measure them and make sure that they're working properly right how do we actually know that one model in production is meaningfully performing better than another and who actually builds that stuff right because it's not a data scientist it's somebody else and as we already mentioned you know who is the person who's actually managing this team and what is their career path look like and how do we measure their success so as a final question I'd like to know if you have a final of action for endlessness yeah so you know I'll say if the if the topics that we mentioned in terms of what we work on at alluvium or that interview process sounds appealing to you you know please drop us a line alluvium io slash careers where we're actively hiring for for data scientists both at the kind of you know mid and senior position as well as for the entry-level folks and then a whole swath of other opportunities for back-end engineers and DevOps engineers and and product engineers so that's definitely one and the other one that is if you're if you're in the very beginning of your career or if you're if you're not even ready for a career yet you're a college student or you're thinking about making a career shift take one thing that I think I did really well when I was in your position and just start kind of writing and talking publicly about what you're doing a lot of folks talk to me and say oh you know I'm I'm not a I don't like public speaking or I don't like putting myself out there I'm kind of introverted I think the best hack for an introvert is to actually get yourself out there because then you don't have to spend a lot of energy going and talking to people because they'll come and talk to you so you know start a blog get out you know volunteer to speak at a meet-up maybe it'll be a little scary the first time you do it but it'll give you a sense of kind of what the opportunities are and you'll build a network and you'll meet people and I think you'll have a lot more success I couldn't agree more and you'll notice how your approach and your ideas and your conceptions change when you try to formulate them to communicate to others absolutely you know sort of old old platitude is you don't really know something till you have to teach it and and you know this sort of a baby step to that is you don't really know something they have to present it at a meet-up I like it and I will put a link to your the alluvium careers page in the show notes also great thank you awesome drew it's been an absolute pleasure having you on the show that was pleasure was mine you know thanks for the great conversation thanks for joining our conversation with drew about how to build data science teams along with the unique challenges of building data science products for industrial users we saw Drew's vision of building data science teams as a set of individuals who collectively cover all aspects of his data science VanDyke and can communicate across it to go from data to insights and actionable x' we also got tremendous insight into eluvians recruitment process which reflects the job itself as much as possible on top of this we saw just how much eluvians work of building data science products for industry requires a combination of Welsh human data science stack tools and new methodologies to deal with streaming data that is higher in volume than that created by Twitter on a daily basis make sure to check out our next episode a conversation with Mara avrech data nerd at large and tidy verse development advocate at our studio Mara and I will talk about exactly what it means to be a data nerd the role of data science in sports data for social good civic tech and the role of data science paradigms such as the tidy verse in the data science ecosystem as a whole do not miss it I'm your host Hugo von Anderson you can follow me on twitter at hugo bound and data camp at data camp you can find all our episodes and show notes at data capcom slash community slash podcast\n"