#225 The Full Stack Data Scientist _ Savin Goyal, Co-Founder & CTO at Outerbounds

The Importance of Machine Learning Operations (MLOps) and its Impact on Customer Experience

In today's fast-paced digital landscape, machine learning systems are becoming increasingly crucial for businesses to stay competitive. However, as machine learning models become more complex, their performance can be affected by various factors, including maintenance, scalability, and deployment. This is where Machine Learning Operations (MLOps) comes in - a set of practices that aim to improve the efficiency and effectiveness of machine learning systems.

One common failure mechanism in MLOps is when a machine learning system goes down, causing an immediate impact on the customer experience. To avoid this, it's essential to design systems in such a way that they can fail gracefully, with only critical functionality remaining intact. This approach ensures that customers continue to receive high-quality services even if certain features or models are not working properly.

The importance of MLOps cannot be overstated, as it directly affects the customer experience and the overall success of a business. By investing in MLOps, organizations can ensure that their machine learning systems are running efficiently, effectively, and with minimal downtime. This leads to improved customer satisfaction, increased revenue, and a competitive edge in the market.

A key aspect of MLOps is experimentation. Machine learning models require ongoing testing, evaluation, and refinement to stay relevant and effective. Without proper experimentation, organizations risk investing heavily in machine learning without seeing any tangible results. Therefore, it's essential to have a clear plan, including realistic expectations, time horizons, and the right support structure around data scientists.

To achieve success with MLOps, organizations need to invest in creating tooling from the onset. This can include infrastructure, software, and other resources that enable data scientists to experiment more effectively. By doing so, they can ensure that their machine learning models are continuously improving, adapting to changing customer needs, and delivering value to the business.

The current conversation around MLOps is shifting towards a more mature understanding of its role in driving business success. With the emergence of gen AI and predictive AI, organizations are recognizing that no single strategy or approach can guarantee success. Instead, they're embracing a combination of multiple machine learning models working together seamlessly to deliver exceptional customer experiences.

This newfound maturity has also led to increased excitement around MLOps. As data scientists become more aware of the importance of experimentation, collaboration, and continuous improvement, organizations are reaping the benefits of their investments in MLOps. By understanding the value of diverse machine learning models working together, businesses can unlock new opportunities for growth, innovation, and customer satisfaction.

In conclusion, Machine Learning Operations (MLOps) is a critical aspect of any organization's success story. By adopting best practices, investing in tooling, and embracing experimentation, organizations can ensure that their machine learning systems are running efficiently, effectively, and with minimal downtime. As the conversation around MLOps continues to evolve, it's essential for businesses to stay up-to-date with the latest trends, tools, and strategies to drive growth, innovation, and customer satisfaction in today's fast-paced digital landscape.

Advice for Organizations Wanting to Start Getting Their Machine Learning in Production

If an organization wants to start getting their machine learning in production, there are several key takeaways to keep in mind. First and foremost, it's essential to have a clear plan, including realistic expectations, time horizons, and the right support structure around data scientists. Without proper planning, organizations risk investing heavily in machine learning without seeing any tangible results.

Furthermore, it's crucial to ensure that data scientists have the necessary resources and infrastructure to experiment effectively. This can include tooling, software, and other resources that enable them to iterate quickly and continuously improve their models. By doing so, organizations can unlock the full potential of their machine learning systems and reap the benefits of their investments.

Another critical aspect of MLOps is experimentation itself. Machine learning models require ongoing testing, evaluation, and refinement to stay relevant and effective. Without proper experimentation, organizations risk falling behind their competitors and missing out on opportunities for growth and innovation.

To avoid this, organizations should prioritize experimentation in their machine learning efforts. This can involve investing in new tools and technologies, such as automated model management platforms or continuous integration/continuous deployment (CI/CD) pipelines. By doing so, data scientists can iterate more quickly, make better decisions, and deliver high-quality results that meet the needs of the business.

Finally, it's essential to recognize that machine learning is not a silver bullet. It requires careful planning, execution, and ongoing maintenance to stay effective. Organizations should be aware of their own limitations and have realistic expectations about what can be achieved through machine learning. By doing so, they can avoid over-investing in solutions that may not deliver the promised results.

In summary, getting machine learning in production requires a combination of clear planning, experimentation, and the right support structure around data scientists. Organizations should prioritize these elements to unlock the full potential of their machine learning systems and drive growth, innovation, and customer satisfaction in today's fast-paced digital landscape.

The Future of MLOps: Embracing Gen AI and Predictive AI

In conclusion, the future of MLOps is looking bright, with a focus on embracing gen AI and predictive AI. As these technologies continue to evolve and improve, organizations will be able to deliver more sophisticated, personalized, and effective machine learning models that drive business success. By staying up-to-date with the latest trends, tools, and strategies, businesses can position themselves for long-term growth and success in today's fast-paced digital landscape.

The Importance of MLOps in Ensuring Customer Experience

Machine Learning Operations (MLOps) plays a critical role in ensuring customer experience in today's digital age. With the increasing adoption of machine learning models, organizations must prioritize the development of robust and reliable MLOps practices to ensure that their systems are running efficiently and effectively.

One of the primary concerns for businesses is maintaining high-quality services even when certain features or models are not working properly. This requires a focus on designing systems that can fail gracefully, with only critical functionality remaining intact. By adopting this approach, organizations can minimize downtime, reduce the risk of data breaches, and ensure that customers continue to receive high-quality services.

Another key aspect of MLOps is experimentation. Machine learning models require ongoing testing, evaluation, and refinement to stay relevant and effective. Without proper experimentation, organizations risk falling behind their competitors and missing out on opportunities for growth and innovation.

To avoid this, businesses should prioritize experimentation in their machine learning efforts. This can involve investing in new tools and technologies, such as automated model management platforms or continuous integration/continuous deployment (CI/CD) pipelines. By doing so, data scientists can iterate more quickly, make better decisions, and deliver high-quality results that meet the needs of the business.

In summary, MLOps plays a critical role in ensuring customer experience by prioritizing high-quality services, experimentation, and careful planning. Organizations should focus on developing robust and reliable MLOps practices to ensure that their systems are running efficiently and effectively, with minimal downtime and maximum value delivery to customers.

"WEBVTTKind: captionsLanguage: enwhat I'm sort of like quite excited by these days is that I think um that maturity in conversation is coming back in the sense that I think now people are understanding that uh gen AI plus predictive AI is a thing it's not as if sort of like you know one strategy or one approach is going to completely up in at the end of the day you need to build machine Learning Systems hi svin great to have you on the show yeah thanks Richie thanks for having me excellent so uh today we're talking about data science in production but what does that actually mean uh to be in production it's in the eyes of the beholder at the end of the day um production is a spectrum it can mean many different things to different individuals uh in the same organization depending on the maturity of your project your definition of production could be quite different um I remember uh back at Netflix uh you know we had sort of like of course uh the Netflix recommendation system is perhaps one of the most famous uh machine Learning System that's out there and you can imagine I think everybody would agree that the definition of shipping a model to production means that you're able to run an AB test against that model uh against sort of like you know live user traffic and that sort of like as a good definition of production but there were a lot of other data informed decisions that the organization was also making and I'm pretty sure that's the same thing across many other companies too where the output of the model informs a business strategy so in that way uh even an offline analysis of a model where you end up generating a Google doc or end up writing a memo for consumption by let's say the business team can go a long way I'm pretty sure a lot of companies in the consumer space uh for example Spotify I believe recently raised their prices or is about to raise prices and there would have been a lot of um stat iCal analysis and very likely machine learning that might have also gone in in terms of uh figuring out what kind of pricing strategy or pricing changes across geographies would work well and that's yet another example of putting models into production uh where no microservices involved uh no high scale low latency inferencing uh is involved but still is as critical to the business as any other machine learning use case okay yeah so it's really it's not just about like making putting some data science in a consumer facing app it could just be oh well this is going to be a change prices or something like that I like that um all right so can you talk me through what are the steps that need to happen to go from this is an internal project this is going to be something customer facing or otherwise in production I mean I think at the end of the day the first question always is that is this model even worthy of being in production right I think there's sort of like a popular statistic uh I don't even know if it is accurate that says that you know there's like X number of models never make their way into production and that X number is obscenely high uh there may well be some truth to it but I think one thing that sort of like you know is most important to understand is that in data science there is an expectation that many of your models will never make their way into production because you run some tests you figure out that yes you know this model isn't really sort of like uh worthy of promoting to production or sort of like you know continuing any further development on top of it and that's okay that's the nature of data science um and what's most important is that are you able to move from one model to the next version of the same model very quickly and effectively or not uh so that's that's basically my viewpoint uh when it comes to figuring out sort of like you know how to promote a model or what does sort of like you know production actually sort of like mean and um now to your original question of sort of like you know what are the different levels uh that are involved uh so if you think about the life cycle of a machine learning project so that's sort of like you know starts out with data you have certain hypothesis uh you look at the data you train a bunch of models and uh then you figure out that okay like um in what capacity would you want to consume these models and uh if it's a consumer setting where you can run uh decently powered AB tests then uh quite often that's sort of of like you know is a good Next Step but um that may or may not be feasible at times uh there are times when you know you just you just cannot run AB tests you're not in a situation uh to do that um and then you can be a little bit more uh creative in terms of running quasi tests um but when that's also not feasible then we usually see sort of like you know people invest a lot more heavily around explainability of their outcomes uh so uh for example one of the more popular uh machine learning application uh Netflix um is this Suite of models called crystal ball that allows Netflix to predict uh the valuation the dollar valuation of any piece of content so that they can be a lot more disciplined in terms of how they construct their portfolio and creating a counterfactual in that particular scenario uh is going to be very tricky as well as uh running an AB test uh can be quite difficult as well and in those scenarios then you sort of like you know um rely on the explainability of the outcomes of your model and then you rely on sort of like your business Judgment at the end of the day and that sort of like you know it's yet another mechanism of figuring out whether the model is worthy of being deployed or not okay so um really a lot of it's about just making sure that you have something that's going to add value uh before you put it in in a place where customers are going to see it okay um so before we get into the details of how you go about doing all this um I'd like to hear some success stories so have you come across any cases where companies have sort of invested in putting their data code into production and then they've seen some sort of benefit yeah yeah um I think uh you know Netflix has always been sort of like the poster child for machine learning adoption and this was the like maybe a decade ago or something when they first uh spoke publicly about the impact of recommendation system on their bottom line and even at that point I think there was a paper at one of the more prominent machine learning conferences where um basically we came to the conclusion that the Netflix recommendation system was helpful to the tune of more than a billion dollars uh to the company's Top Line and that's that's really impactful and and now basically if we sort of like you know move forward a decade later where exactly sort of like the biggest impact has been made have been sort of like the long taale of machine learning applications um a lot of organizations of course you know if you're let's say n new to machine learning then you end up focusing on your tin pole use cases so for example for Netflix it would be their recommendation systems for Facebook or Google it would be their advertising or search systems and that's great that's where you'll derive sort of like you know the larg just uh value but as you sort of like mature up then there are a lot of internal business processes a lot of other uh consumer interfaces uh that can be made a lot better with machine learning and the aggregate effect of that usually ends up surpassing uh the tent pole use case and that that is something that we have seen quite often uh not only at Netflix but of course you know we have a popular open source project called metaflow uh which sort of like helps organizations build machine learning models and deploy that in production as well so we've been able to sort of like see uh the impact uh across other companies too so um there are like companies like 23 and me uh who um use machine learning as well as data science uh to uh predict uh you know uh traits and outcomes uh using genetic data and that has been uh quite remarkable uh in terms of um just doing First Rate uh medical research we have companies like you know in the similar space like metronics uh we been focused on surgical Ai and that's yet another area if you just look at drug Discovery or automated surgeries the amount of um value that can be generated uh and the actual real world impact uh to people's lives is is really great uh another example uh that I came across during the pandemic uh was this company called zipline uh they buil automated drones uh that are responsible for bulk of drug supplies including covid vaccines in subs Southern Africa and uh their drones are model uh like they are powered through machine learning algorithms as well and at the end of the day of course you know we look at sort of like uh the big Tech sector and how sort of like you know uh interesting consumer applications are made U that sort of like you know is always is uh getting the Limelight but when you look at sort of like the real world impact that is actually really sort of like uh shaping uh people's day-to-day um I think that's that's been sort of like you know a lot more rewarding and enriching to just see sort of like from the sidelines at times absolutely so some very different examples there I mean uh I suppose the the Netflix example where you're getting a better movie recommendation it's like it's a small benefit to a lot of people there because it's scaled so much whereas on the other extent the idea of like um a AI surgeon that's like that's helping one person a lot so you got that um the very different use cases there um all right so uh let's get into a bit of depth on uh some of these steps that are involved in um like before you deploy your code so um what sort of tests do you need to perform on your model be so that you know it's ready to go live I it it depends uh you know as with anything else um so you can imagine the bar I mean of course you know like now nowadays if let's say the Netflix recommendation starts aing crappy recommendations then uh it'll become a social mean so in many ways the amount of due diligence that somebody needs to push out uh a new machine learning model for recommendation systems for popular services like Netflix Spotify and whatnot is sort of like considerably higher but nobody is really going to lose their life right I mean it's it's not a life altering thing if you push out a bad ml model versus if you're looking at let's say the healthcare space or the fintech space then any kind of bias or any kind of issues with the model uh can have life-altering outcomes uh so the bar sort of like you know really sort of like goes up significantly and um fortunately or unfortunately that sort of like varies from uh industry to Industry as well as the application that you're sort of like actually using it so there isn't sort of like a one- siiz fit all uh answer but the most important thing always is that as a data scientist or as a data professional do you have full understanding full view of uh how exactly a model is actually going to be used and are you able to sort of like control and observe those outcomes and really sort of like iterate from that and we've seen that as sort of like you know one of the more primary uh failure mechanisms where you can imagine U if let's say putting a model to production or doing data science involves many many individuals in an organization on one single project project then uh many of these important concerns become nobody's problem because there's no one single individual who has complete control over the end to-end process and that's been sort of like one of the biggest uh push that we have been advocating for that how do you essentially make your data scientist a full stack data scientist or an endtoend data scientist so that they have uh full control uh around the model the kind of data that they are uh building these models on when are these models actually getting rolled out what are the impact of these models on real world outcomes and not just sort of like you know statistical measures okay so this um interesting she's saying one of the biggest problems then uh in terms of um maybe like quality control is going to be the fact that there's lots of different people doing different stages of the workflow so in order to eliminate that you have one person who is responsible like from end to end uh this is give be a full stack data scientist upun uh do you want to tell me more about this role and what it would involve yeah yeah so so you can imagine you know like uh if we let's say hearken back to the 70s or the 80s and I mean this might even be true uh today as well in many places um shipping software was really expensive uh shipping software of any kind uh so of course you know you had sort of like your development team composed of software engineers then you had a testing team composed of QA engineers then you had a different group of people who are focused on shipping software uh through sort of like you know uh these release managers and then you had your database admins and application Architects and um sres in the mix as well and that sort of like raise the cost of doing anything in software engineering of course there are plenty of areas where uh having uh specialized roles makes uh a lot more sense uh of course definitely you know when you sort of like get to a certain level of scale as well uh it's a lot more important to make sure that you have multiple eyes but if you have um five or seven of these different roles involved for every single project than doing even the most simple thing becomes a lot more expensive a lot more time consuming and uh that's been sort of like you know one of the big promises um of the modern devops movement as well as the Advent of the cloud that how can you have basically one single software engineer who can run the entire gamut I mean of course you know there isn't an expectation that overnight they're also sort of like you know significantly skilled at building uh you know front end as well as backend and sort of like you know being an expert around cicd but the tooling uh has evolved and matured where you can have an expectation that a small software engineering team can return outsize returns and um we're basically hoping for the same thing uh on the data science side of the house as well that how do you basically ensure that it's not only sort of like you know these 10 pole use cases that are reserved for machine learning and you can ensure that a single data scientist or a small team of data scientist uh can really sort of like deliver outsized business impact by running the entire life cycle of a data science project right so imagine sort of like all the way from how do you figure out uh what data does an organization have is that data available in the right format does that data have the right quality can I access that data in an interactive manner so that I can play around with that data and really sort of like understand um what are the possibilities that that data has encoded then at the same time do I bring in sufficient uh business skills to the table do I understand the business perspective that my organization is involved in so that I can marry my data science skills uh with this business perspective and uh start to figure out you know what are the best ways for me to optimize this specific business problem and what we have seen historically is that data scientists come from uh quantitative disciplines uh that may not be software engineering related right and that's where sort of like one of the first uh bottleneck uh presents itself that if let's say you're playing around with a lot of data or if you're in an organization where you know everything needs to happen uh within a certain specific governance boundary of either your data warehouse or within a specific cloud and how do you interface with that cloud how do you sort of like you know uh interface with all of the engineering and uh business architecture complexity uh is sort of like you know one of the big areas where uh people see declines in productivity and that then sort of like necessitate specialized roles to sort of like come in and you oftentimes end up in a problem where Things Fall look tracks or certain important things that are like okay you know like how do you figure out if the right thing is happening unfortunately becomes nobody's problem because there's no one single individual who has complete inin perspective uh so that was like one of the big reasons why we ended up creating uh metaflow back at Netflix which is sort of like uh an open source machine learning platform and it's geared towards ensuring that a data scientist can become this full stack into a data scientist and they can control more of their destiny and that should in theory then allow them to iterate uh on their machine learning models on their data science projects a lot more quickly and uh in that scenario you then also able to sort of like you know gradually move the interface between data science teams and other software engineering teams if you realize like a decade ago when I started my career um data scientist uh role used to be limited to prototyping on a laptop and then there would be teams of Engineers who would take that prototyp I scale it out and then deploy it and off you go right and the scenario there was that of course you know as you scale out your model training on a lot of data then many of the statistical guarantees that a data scientist was looking out for May no longer hold true and uh the software engineering team doesn't really understand the intention behind anything that a data scientist was trying to do the data scientist has no idea what got shipped into production and uh it was sort of like anybody's guess if even the right thing was happening in production and now we are sort of like getting into a point where uh a data scientist should be able to let's say expose an API to their work maybe that API is a model maybe that API is an actual rest endpoint that you can call into maybe that API is a memo that has been written for consumption by sort of like you know other teams but at least sort of like you know that provides a lot more control a lot more visibility uh to a data science team and sort of like shipping up their work lots to cover there um and we'll certainly get into metlow later but um it sounds like when you got this idea of full data science they need the data science skills they need software engineering skills they need business skills this seems like it's going to be quite difficult to hire for and I'm wondering whether there's a tradeoff between having a large number of more specialized individuals versus having this generalist who can do everything do you want to tell me how like um a data team's going to be comprised then would you I presume you wouldn't want all full St data scientist or all uh more specialized individuals you can to mix the two like how does it work yeah yeah I mean I think the answer uh is better tooling at the end of the day uh of course if the expectation is that we are able to sort of like you know find somebody who is uh really great at engineering really great at data science really great at uh business sense uh that's a proverbial unicorn uh you may be able to find a few but definitely that's not sort of like you know uh a strategy that can scale out and and uh now definitely uh two out of three uh is something that would be desirable finding somebody who is sort of like you know equally good at data science as well as understanding the business intimately I think uh that in many ways is sort of like the minimum bar on hiring an exceptional data science is uh but paired with great tooling uh you can definitely sort of like ensure that the level of abstraction that they are working on uh at least The Accidental complexities of the cloud can be taken care of uh for them and and then uh you can expect that this particular data scientist to then sort of like you know take care of the business complexities and the data science complexities uh so that's that's basically sort of like is where I see many uh data science teams to be moving towards as well uh I think you know that's sort of like one of the big reasons why uh many companies have also invested in their internal ml platform teams as well uh where the entire prerogative of these themes is to provide these set of tools internally sort of like you know a point of Leverage um so that every single data scientist is insulated from the harity of engineering uh but the interesting sort of like Dynamic there is that how do you build tools that are really good at navigating uh around this fact that there is some amount of complexity that a data scientist would want to take care of and then there is some amount of complexity that they would want the tool to take care of and how do you sort of like thread that balance it's always an interesting question okay definitely uh and yeah I definitely wanted to talk about Tools in a bit but uh just to press you on this idea of teams then so if someone says okay I'm a regular data I'm a regular data scientist I want to become a full stack data scientist um what needs to happen to take that extra step I think the question there is that if you're not this full stack data scientist then what kind of data scientist are you and um my colleague uh here at outer bounds uh sort of like you know his favorite term um is um laptop data scientist uh somebody who is like you know very well Adept with let's say the pythonic ecosystem or just like you know uh everything that's available on the machine learning side of the house and is uh able to understand uh the characteristics of the data and uh sort of like you know get their work done so there's sort of like that aspect and then on the other uh side of the house you need to figure out as an organization how do you actually ship value through data science right and then there's sort of like the Gul of complexity that you need to cross in between and um one question or like one way is that yes you know like you become equipped at um handling that Gulf of complexity all by yourself and example clear-cut example here would be let's say uh you want to train uh many different machine learning models uh using gpus and um you're sort of like you know constantly I trading on sort of like you know different hypothesis and uh that GPU form very likely is not going to be your laptop right like it's going to be something uh in the cloud uh let's assume that you know it's kubernetes the most popular uh computer orchestrators out there now on one side I can have an expectation that maybe my data scientist understands the integrities of kubernetes ecosystem and how to sort of like you know run these kubernetes spots reliably and manage them and monitor them and when things go wrong knows his or her way around uh debugging um these failures uh but it's it's a very complicated landscape and unfortunately if it was only kubernetes that uh people had to worry about life would still be easy then you also have to worry about that okay I have my data that needs to come from somewhere needs to go somewhere else couple that with kubernetes how do I sort of like think about that I'm constantly experimenting reproducibility can be a big problem if let's say my colleague is running into a failure I'm supposed to help them out but if I cannot reproduce or replicate that same failure what are my odds of even sort of like you know being capable of helping them out one bit so the complexity very quickly multiplies and uh there are now multiple tools in the space that sort of like help uh in this specific area so uh making themselves well apprised uh of the latest and greatest tooling would be one I think as practicing software Engineers um it falls on us as well to sort of like really understand where the world is headed uh what are the new paradigms uh around engineering and I think it's the same thing that I think most data scientists also understand that uh for them to sort of like stay relevant uh they also need to sort of like equip themselves with if not necessarily the details of every single thing out there uh at least sort of like understanding what are the layers of abstractions that are available on top of these building blocks that can help them uh get their work done okay so uh really just make sure that you're on top of like the latest tools that's going to stand you in good setad for um improving your skills okay all right so um let's talk about metaflow since uh since you're the creator of it uh so uh to begin with can you tell me what does metaflow do yeah uh so so metaflow is an open source ml platform uh to put it successfully it helps you train deploy uh ml models and uh Target it towards essentially building ml systems I think you know in this conversation we have spoken a lot about data science and machine learning per se but at the end of the day an organization is trying to build a system and a machine learning model is only a part of that system so so how do you basically get to building these systems which can often times be complex uh they may cross uh Team boundaries uh they may interface with uh significant engineering infrastructure how do you basically ensure that um a data scientist or a team of data scientists is capable of doing all of that is basically what metaflow uh stries to do okay so it's just going to help you uh take the sort of steps into getting your code into production um now I know there are just dozens and dozens of mlops tools so can you talk me through how metaflow fits into this larger ecosystem of tools yeah yeah so I I can walk you through um what sort of like you know even prompted us uh to start the project in the first place um now of course you know many of these uh tools uh they have taken a life of their own and sort of like you know they uh cater to uh different markets or uh different use cases uh what meta flu is targeted towards is a practicing data scientist so it is not a low code no code solution it is a solution that's targeted towards a data scientist who either understands python or R really well so and um also brings in that data science understanding to the table so we are not in the business of teaching people uh how to do data science uh metaflow is a tool that enables people to do data science well uh so that's that's sort of like you know is the big uh thing here um so we started working metaflow back in 2017 so gosh it's like now close to seven years and we were at a spot uh back at Netflix where uh Netflix was now looking into investing in machine learning across the entire life cycle of their business right so not only sort of like how do you do recommendations well but how do you construct a portfolio of content uh that is going to drive your subscription base uh higher and higher up uh how do you figure out what is the best content that's available uh how do you leverage economies of scale in either licensing or producing that content how do you take these bits and stream it to people's TVs and mobile devices so that they have an amazing uh streaming experience uh how do you fight fraud how do you take care of pricing challenges you can imagine you know like if you start thinking about all the places where you can start investing in from a machine learning standpoint um at a company like Netflix and it's really sort of like you know many ways like a kid in a Candy Land and the other sort of like interesting aspect was that while Netflix is usually lumped into sort of like you know this cohort of fang companies and there sort of like a connotation with Fang scale I think Netflix is a lot more closer uh to your average uh company that is on the public Cloud uh but it's sort of like you know just a whole bunch of different interesting problems that Netflix is trying to solve that like adds to that complexity so we were now getting to a spot where the solutions the tools that we had built for our recommendation systems had sort of like served us really well uh they were predominantly built on top of the jvm stack that was sort of like really popular uh in sort of like you know uh the uh early to mid uh 2010s and now we were coming to a spot where the number of people who were excellent at engineering at data science at business uh were very few and of course now if you have to start investing in many different areas of data science you have to then sort of like you know pick and choose your battle and we said that okay of course you know like we can't really sort of like skam on hiring the very best data science Talent uh that's available uh but then of course you know somebody has to come in and uh really sort of like PVE over uh that Gulf of engineering complexity and uh then the goal for us was that okay how do we basically realize the stream of making our data scientists full stack data scientists how do we basically provide them Solutions where uh they can get all of their work done on their laptop uh but we can bring the cloud to their laptop right so so you can basically sort of like imagine can I sort of like you know provide uh hundreds of thousands of CPU cores and hundreds and thousands of gpus and uh terabytes and pedabytes of ram to your laptop uh so that you don't have to become a cloud engineer uh to scale out your machine learning projects uh how do we ensure that people can take a graduated approach because you can imagine you know like not every single project uh will be a humongous scale machine learning project but then at the same time every single machine learning project will help uh uh or will will go sort of like you know much better if there is some amount of discipline picked in right I think there have been plenty of times when uh we have run into this issue where something works on my laptop but does not work on my colleague's laptop or uh I'm able to let's say you know install a version of pyarge today but uh not able to install the same version of pyarge two days later or you know a transitive dependency has changed and something is like subtly off and I'm sort of like you know trying to figure out what went wrong where so it's sort of like like a barrage of small little problems as well as you know some rather nefarious problems as well uh particularly on the computer and the data side that you have to sort of like you know start worrying about as a data scientist and uh our goal was that okay can we ensure that they don't necessarily have to worry about it and can they just like squarely Focus their efforts and energy on rangling the data science complexity that's what their expertise is in and if let's say Netflix as an organization is expecting them to spend a lot much more time wrangling of like you know nefarious issues with like Hey how do I move data from my data warehouse to my GPU instance so that my GPU Cycles are not wasted that's that's not a thing that a data scientist should be focused on very early in U data science project because you don't even know at that particular point if the approach that you are taking for your machine learning model is even worth it but if you sort of like you know take an aggregate view if you have hundreds of data scientists all running gpus suboptimally then that sort of like expense really adds up uh to a non-trivial number that as a platform engineering team I do indeed have to care about but if you can codify all the best practices and uh provide a user experience that is a lot more humancentric that works with the data scientist that a data scientist doesn't have to fight against then it becomes a lot more easier where by default all the right things happen on the engineering side of the house the data scientists their freedom of choice uh in terms of how they want to navigate the web of data science complexity is preserved and the organization then sort of like you know gets to benefit both from uh cost optimization because ml can become expensive at times if you're not careful about it as well as making sure that you know you're able to innovate uh quite actively and sort of like you know quite quickly one thing you mentioned is um the idea of working on a laptop and so there's been a huge Trend in the last decade or more about everything is going to the cloud so the idea of working on a laptop sometimes but also having access to like these you know large scale compute that's in the cloud um that sort of indicates some kind of hybrid Computing is is that the approach you push for or um is is there like a reversal of the incloud trend yeah yeah no I mean of course you know there are like many benefits of being in the cloud for example uh if let's say all your data is in the cloud you don't want any of that data to ever leave the cloud it's like one big reason purely from a security standpoint why everything would want to happen in the cloud but your laptop can still be the interface to that cloud right um so from that point of view you might still be accessing all your resources through the laptop but the code that you might be writing might actually uh be running entirely in the cloud and the data may never actually sort of like you know show up on your laptop and everything might happen sort of like through your ID or through your browser so that's sort of like one universe and then the other universes at times I mean you know like the problem may not be sort of like uh the problem that you're trying to solve require uh very steep computational resources or managing a lot of data right uh the data could be enough to F fit in your laptop it may not be sort of like super sensitive uh you could use something like psych learn or you know like many other sort of like popular Frameworks uh to build your machine learning models I mean the number of things that my MacBook can do these days I mean it's just like um Beyond imagination uh but then what you still sort of like need at that particular moment is uh still some discipline right uh you still want to figure out how you're going to catalog your experiments you still want to figure out what is the best mechanism uh for you to ensure reproducibility uh so that you know you're able to sort of like understand how your models are behaving or if you need to course correct then you're able to do that easily and at times let's say you know many times one definition of uh productionizing your model uh can be that okay whatever work you have done on your laptop now of course at the end of the day you're going to shut your laptop close and you're going to go back home but maybe you want to sort of like you know run that model training process every single night every single week maybe when new data shows up how do you sort of like you know push that into the cloud and I mean it could be your on-frame infrastructure as well right but basically how do you sort of like you know take something that is running on your laptop and reliably run it elsewhere uh that can be sort of like one definition of productionizing your machine learning model uh in a variety of projects and um that can be a big um activity uh where a lot of organizations May sync a lot of resources where the data scientist was able to prototype something on their laptop but now just this process of converting their wees into something that can be run outside their laptop can be an activity that is sort of like measured in months or quarters and uh for us that was another goal that okay can we sort of like take all the work that a data scientist is doing whether that work is in the cloud or whether it is in the laptop but it is available in a format such that it can be run anywhere else uh almost immediately so so you don't have to sort of like any at least then uh worry about this process of like okay I had something that was running but now I need to go back to my manager and sort of like can ask for another one month before so that I can come back address all of my pain points and issues and refactor my code so that it's sort of like not worthy to P in production but can we sort of like you know just flip the script such that the infrastructure basically allows you to do all the right things from the outside okay yeah certainly that's um like anything involving package management like environment envir and just having things not working from one machine to another that can be incredibly frustrating uh if you've got a deal with this manually um so the idea of reproduc reproducibility is one important uh aspect of getting things in production um another thing seems to be scalability so um once you C go into production like you maybe got um it being your model being accessed by millions of people um how do you make sure that your code is going to be um scalable I think in many cases especially sort of like you know on the consumer side of the house many times it might not be feasible to understand what kind of scalability requirements you're gunning for um in the first place right in in many cases it is indeed possible especially if let's say you know this model is a subsequent version of a previous model but if you're on a net new project um especially on the consumer side of the house um with virality loops involved uh the kind of scale that you may run into uh May quite be unpredictable um I mean at the end of the day it sort of like all boils down to uh the question of project management and just like software engineering skills that if you let's say deploying a model let's say in this case if you're talking about um think let's let's think about let's say you know um recommendation systems right I mean because that's one area that I'm very well familiar with uh if you think through let's say users tastes and preferences right so if you're let's say on Spotify or if you're on Netflix then uh there isn't sort of like a lot of brand new content that is coming in very very quickly as well as your tastes are not really changing uh very quickly either and you already know what your entire user base looks like and what their preferences are so you can pre-compute those recommendations and then just serve those recommendations from a database right so you're not sort of like doing any kind of live uh model inferencing and that has like you know amazing scalability benefits it's a very simple straightforward approach but of course it may or may not work in many use cases um so my recommendation for f is if you already understand what your scalability uh metrics are that you're trying to achieve then there's always sort of like um an architecture uh that's possible uh of course you know the amount of expenditure that you're willing to uh sort of like um incur in that project is also a big input to that uh but don't overthink it uh don't prematurely optimize it um and um there's there sort of like you know plenty of hacks and approaches that people can take to at least sort of like buy more time uh before you really understand sort of like you know what kind of uh next scalability benchmarks you need to be sort of like going for um I think there's sort of like a similar scalability hurdle uh that is present on the model training side as well uh at times and um that's that's sort of like you know in many ways a silent killer for many organizations where uh usually sort of like the deployed model of course you know it's sort of like generating business value so you want to really make sure that that sort of like works well and there's like a lot of um uh light that's shown on sort of like those use cases and then you have let's say some models that you're training that would be consumed directly in production and people are also sort of like you know ensuring that yes you have let's say all sorts of alerts and observability that's sort of like instrumented on so that those models are actually generated on time uh with sort of like you know reliable results but the third sort of like bucket where you have teams of data scientists who are actively experimenting uh that can be a big uh cost Vector as well and what we have seen many times is that the overall Cloud costs for experimentation can be orders of magnitude higher than your um cost uh that a deployed model is incurring at times because you may have like you know hundreds of models that you're experimenting with but only a few models or maybe tens of models that are sort of like deployed uh in production and um the UN reality is that with experimental models it's also very hard uh to sort of like you know ensure that uh from a cost efficiency standpoint uh you're sort of like you know as a data scientist you have uh complete awareness of how you want to actually optimize that model training or if you want to let's say scale out that model training uh then what kind of engineering effort would be needed as well so one uh interesting sort of like example that comes to mind here is that I was working with a data scientist and they basically uh guess you can sort of like debate whether this was a good reason or not but they were basically trying to um predict what content is going to be popular in any given geography uh at any given moment so that uh the content distribution Network the CDN infrastructure can be seeded with the right kind of content right so imagine if let's say you know your or a company like Spotify and you know that certain kind of music is going to be popular in certain geography at certain hours or if you're a company like Netflix who is releasing a brand new show and they have done a significant amount of marketing in let's say Australia then you would want to make sure that people have an amazing experience streaming that content and you don't run into sort of like rebuffering neighborhood in the world and they decided to build 60,000 models in one shot each of those model required a container with a GPU and now this is immense amount of compute that you're running right and ahead of time you don't even know what is the ROI of this entire effort going to be uh and if let's say it is just to a data scientist it can be a significant engineering challenge all by itself that how do you run this much amount of compute uh without being uh a professional Cloud engineer and even for a single professional Cloud engineer this s like you know often times bridge too far and um with metaflow they were able to run that compute very seamlessly and at the end of the day they sort of like you know also got a nice bill of like okay here's how much money that you have spent and of course you know like uh when you start spending so much amount of money some eyebrows are often times raised and uh so of course you know people sort of like wanted to understand uh whether uh the spend was sort of like worth it and you can imagine you know like in this case uh obviously uh the amount of capital that this company was able to save in terms of sort of like their CDN optimization was well well worth this expense of sort of like you know running sort of like 60,000 gpus fully engaged for like multiple days um many times it may not be um and just like you know having that perspective uh at times as well that okay you may want to scale but is that scale actually linked to your business outcomes or not can be a lot more important too yeah I can see how you certainly want to just speak to some other people before you um fire up 60,000 gpus and run them for a few days yeah probably get best to get that business alignment first um so um I'm getting the big theme of that is like just um don't do calculations that you don't need to do and just make sure you have metrics around like how performant you need things to be um so I think we talked a little bit about reproducibility but scalability the other aspect of things in production seems to be robustness because as soon as things such users they're going to give you stup impuls things will behave a weird ways um do you have any tips for how to make your um your data programs models whatever uh more robust yeah so I mean of course you know like we have to think through robustness through the entire sort of like layer of the ml infrastructure stack uh in many ways so of course there sort of like uh just a thing that okay does your model actually encapsulate the behavior that it is trying to predict uh have you taken into account sort of like you know things like seasonality and all of that so um I sort of like put all of those concerns on the data science side of the house that are data scientist needs to sort of like worry about in many ways versus as a tool Builder what I sort of like personally um love to focus more on is that is the underlying infrastructure robust enough uh because many times if your infrastructure gives way and you're unable to let's say generate a fresher version of the model then your model performance is going to a hit and that will have a direct sort of like you know hit to the business kpis as well and then the question sort of like becomes said okay how do you sort of like get to a point where your infrastructure is robust but then more importantly of course in the age of cloud you can't sort of like promise 100% up time for any piece of infrastructure and you can sort of like increase your robustness rates but I think as you pointed out right like dependency management can be a big issue so today you are able to install pyos tomorrow you may not be able to install the same version the exact same way and what do you do at that particular point and the big question then becomes is how do you quickly recover from these errors or how do you quickly recover from these failure mechanisms if let's say you have um so let's say you have a training pipeline uh to train a model but that training pipeline to train that model depends on yet another Upstream pipeline that is generating some embeddings now if that uh embeddings pipeline is failing for whatever reason of course your Downstream uh model training pipeline cannot execute and there may be sort of like you know other processes as well that are dependent on it so it becomes very imperative to figure out what is the quickest way to be able to diagnose what went wrong with your embedding Pipeline and how do you basically recover from that failure so that your subsequent pipelines can start sort of like you know executing and that is often times one of the areas that can be quite underinvested in an organization where uh doing machine learning is so difficult at times involves so many different moving pieces that the focus is on getting the happy path working and not necessarily focusing a lot on when that happy path is not quite happy when failures happen how do you sort of you recover from that and uh the complexity arises from the fact that so many different things can fail uh right I mean a lot of people sort of like um these days uh they try to find cheaper GPU resources so they end up going to let's say you know a cloud provider that might not be sort of like one of the more prominent hyperscalers and then they unfortunately realize that the machine that they are buying it was advertised to have four gpus attached to it but unfortunately only has three working GPU drivers and that's why sort of like you know certain things are slow or failing and then figur out how to recover from it right or um your data uh changed uh for whatever reason and now middle of the night you have to wake up and you have to sort of like like you know step through your work and even sort of like replicating what failed can be at times really tricky and then figure out that okay this was the actual change in data uh that sort of like uh was the cause for trouble either patch your pipeline or wake up the person who was responsible for the Upstream data pipeline so that they can sort of like fix the error and that's that BEC sort of like you know one of the bigger themes that as an organization how do you basically recover from these errors your uh MTR how do you sort of like lower that down and that was like one area that we sort of like uh focused on quite uh a lot and I think that sort of like also pairs into this notion of reproducibility as well that uh while yes you want to reproduce the good behavior of the model so that you have more trust in that model but you also need to be able to reproduce the failure patterns uh somewhat reliabil of course I mean there's a lot of stochastic ISM involved in failures as well but at least for certain class of failures if you're able to reproduce them reliably then you also stand a short and uh being able to fix those uh quite quickly andile okay so I know being on call is like a standard feature of being a software engineer waking up at like 3:00 a.m. to try and fix some data pipeline or even worse having to wake your colleagues up as well to say okay can you help you debug this at 3: a.m. that seems like something I wouldn't want to do on a regular basis so can you talk me through like what sort of processes should you put in place to make sure that you don't have regular failures like how can you improve that reliability yeah I mean the thing is at the end of the day you know if let's say you have like one simple uh strategy here is if you have a pipeline that is running let's say every week and if it is super critical then you may want to run that pipeline every day but only consume the output every week right so that then it Le sort of like you know if it fails in between then it's not sort of like something that you need to wake up in the middle of the night you have an entire business day or an entire week or like half a week on average to actually sort of like fix up the issue before sort of like it becomes a burning issue so playing with sort of like you know that frequency um sort of like Arbitrage uh on the training side uh is almost always sort of like useful of course uh it comes with sort of like you know uh extra cost as well and many times that cost May well be worth it uh because if it is not sort of like super urgent then why sort of like wake up in the middle of the night you can sort of like wake up during business hours and address it on the uh model inferencing side um I think this is sort of like you know many many techniques exist from the software engineering standpoint uh but the one thing always is that you know always have sensible defaults so for whatever reason your machine learning system is down you can always sort of like fall back on htics or certain sort of like rules uh so that the End customer experience isn't impacted I think that's sort of like you know one of the more common failure mechanisms that uh I end up seeing unfortunately where because a machine learning system is down uh the sort of like an immediate customer impact or something is broken many times it's unavoidable but uh definitely sort of like um the systems can be designed in such a way that the user may get a subpar experience but at least the critical functionality is not entirely impair okay so you want to kind of fail gracefully and just like maybe cut off one bit that's not working and have everything else work all right nice okay so uh before we wrap up what are you most excited about in the world of mlops at the moment yeah so I think uh you know definitely couple of years ago uh especially when chat gbd came out uh there was sort of like this uh conversation uh going on on Twitter or x that you know is this the death of data science uh well sort of like you know data scientist as a job function even exist and I think what I'm sort of like quite excited by these days is that I think um that maturity in conversation is coming back in the sense that I think now people are understanding that uh gen AI plus predictive AI is a thing it's not as if sort of like you know one strategy or one approach is going to completely up in at the end of the day you need to build machine Learning Systems uh these machine Learning Systems may be a combination of many different models which might be built using a variety of different ways so even if you're building a recommendation system your recommendations could be coming from let's say a deep learning model but somebody still needs to convince the end user that those recommendations are indeed worth their time and uh that sort of like you know compelling uh narrative uh could be derived from an llm and uh the album art could again be sort of like you know some gen image model that can sort of like you know uh convince you that yes you know like that particular content or that particular song is definitely sort of like worth your attention so I think now people are really sort of like warming up to the idea uh that it's sort of like multiple uh ml models all working in cohesion that actually sort of like you know Drive um a strong consumer experience or are able to uh enable an organization to optimize their internal business processes and that's that's always exciting from a machine Learning System uh point of view like you know how do you sort of like you know then uh tackle this increased diversity or increased complexity uh in any system okay so having just like really complex systems of like lots and lots of different models all working in harmony um yeah that that that maybe sounds like not step one if you're trying to uh like start putting things in production but maybe that's that's a very good end goal to work towards all right super uh do you have any final advice for organizations wanting to start getting their machine learning in production yeah so I mean of course you know uh machine learning at the end of the day it's not a silver bullet uh it requires it's it's it's experimentation at the end of the day and you may end up investing a lot in data science and not really see any results for a really long period of time and what really matters is ensuring that you have a plan um before you sort of like you know start investing in ml uh you have the right expectations uh the right time Horizon and more importantly you have uh either the right support structure around data scientists and this could be sort of like you know in the form of ensuring that you have people who understand infrastructure as well as data science really well and are able to work well with one another or in the absence of that you are able to invest in create tooling uh from the onset so that your data scientists at the end of the day are able to experiment a lot more effectively because um the the one sort of like uh sure short way of failing in machine learning is by not being able to experiment enough if your data scientist is only able to ship one version of a model in a week a month or a quarter whatever that time Horizon is that may just not be enough if you're able to really sort of like you know ensure that their iteration loops are measured in minutes or maybe hours then that sort of like is a good way of ensuring that the quality of your machine learning model sort of like you know continues to go up uh eventually to a point where it may beat uh certain predefined rules or heuristics and then that's when sort of like you know you start uh reaping the benefits of your investment and machine learning okay uh so yeah uh that sounds like a a great advice um thank you very much for your time s yeah thank you thanks for having me with you ohwhat I'm sort of like quite excited by these days is that I think um that maturity in conversation is coming back in the sense that I think now people are understanding that uh gen AI plus predictive AI is a thing it's not as if sort of like you know one strategy or one approach is going to completely up in at the end of the day you need to build machine Learning Systems hi svin great to have you on the show yeah thanks Richie thanks for having me excellent so uh today we're talking about data science in production but what does that actually mean uh to be in production it's in the eyes of the beholder at the end of the day um production is a spectrum it can mean many different things to different individuals uh in the same organization depending on the maturity of your project your definition of production could be quite different um I remember uh back at Netflix uh you know we had sort of like of course uh the Netflix recommendation system is perhaps one of the most famous uh machine Learning System that's out there and you can imagine I think everybody would agree that the definition of shipping a model to production means that you're able to run an AB test against that model uh against sort of like you know live user traffic and that sort of like as a good definition of production but there were a lot of other data informed decisions that the organization was also making and I'm pretty sure that's the same thing across many other companies too where the output of the model informs a business strategy so in that way uh even an offline analysis of a model where you end up generating a Google doc or end up writing a memo for consumption by let's say the business team can go a long way I'm pretty sure a lot of companies in the consumer space uh for example Spotify I believe recently raised their prices or is about to raise prices and there would have been a lot of um stat iCal analysis and very likely machine learning that might have also gone in in terms of uh figuring out what kind of pricing strategy or pricing changes across geographies would work well and that's yet another example of putting models into production uh where no microservices involved uh no high scale low latency inferencing uh is involved but still is as critical to the business as any other machine learning use case okay yeah so it's really it's not just about like making putting some data science in a consumer facing app it could just be oh well this is going to be a change prices or something like that I like that um all right so can you talk me through what are the steps that need to happen to go from this is an internal project this is going to be something customer facing or otherwise in production I mean I think at the end of the day the first question always is that is this model even worthy of being in production right I think there's sort of like a popular statistic uh I don't even know if it is accurate that says that you know there's like X number of models never make their way into production and that X number is obscenely high uh there may well be some truth to it but I think one thing that sort of like you know is most important to understand is that in data science there is an expectation that many of your models will never make their way into production because you run some tests you figure out that yes you know this model isn't really sort of like uh worthy of promoting to production or sort of like you know continuing any further development on top of it and that's okay that's the nature of data science um and what's most important is that are you able to move from one model to the next version of the same model very quickly and effectively or not uh so that's that's basically my viewpoint uh when it comes to figuring out sort of like you know how to promote a model or what does sort of like you know production actually sort of like mean and um now to your original question of sort of like you know what are the different levels uh that are involved uh so if you think about the life cycle of a machine learning project so that's sort of like you know starts out with data you have certain hypothesis uh you look at the data you train a bunch of models and uh then you figure out that okay like um in what capacity would you want to consume these models and uh if it's a consumer setting where you can run uh decently powered AB tests then uh quite often that's sort of of like you know is a good Next Step but um that may or may not be feasible at times uh there are times when you know you just you just cannot run AB tests you're not in a situation uh to do that um and then you can be a little bit more uh creative in terms of running quasi tests um but when that's also not feasible then we usually see sort of like you know people invest a lot more heavily around explainability of their outcomes uh so uh for example one of the more popular uh machine learning application uh Netflix um is this Suite of models called crystal ball that allows Netflix to predict uh the valuation the dollar valuation of any piece of content so that they can be a lot more disciplined in terms of how they construct their portfolio and creating a counterfactual in that particular scenario uh is going to be very tricky as well as uh running an AB test uh can be quite difficult as well and in those scenarios then you sort of like you know um rely on the explainability of the outcomes of your model and then you rely on sort of like your business Judgment at the end of the day and that sort of like you know it's yet another mechanism of figuring out whether the model is worthy of being deployed or not okay so um really a lot of it's about just making sure that you have something that's going to add value uh before you put it in in a place where customers are going to see it okay um so before we get into the details of how you go about doing all this um I'd like to hear some success stories so have you come across any cases where companies have sort of invested in putting their data code into production and then they've seen some sort of benefit yeah yeah um I think uh you know Netflix has always been sort of like the poster child for machine learning adoption and this was the like maybe a decade ago or something when they first uh spoke publicly about the impact of recommendation system on their bottom line and even at that point I think there was a paper at one of the more prominent machine learning conferences where um basically we came to the conclusion that the Netflix recommendation system was helpful to the tune of more than a billion dollars uh to the company's Top Line and that's that's really impactful and and now basically if we sort of like you know move forward a decade later where exactly sort of like the biggest impact has been made have been sort of like the long taale of machine learning applications um a lot of organizations of course you know if you're let's say n new to machine learning then you end up focusing on your tin pole use cases so for example for Netflix it would be their recommendation systems for Facebook or Google it would be their advertising or search systems and that's great that's where you'll derive sort of like you know the larg just uh value but as you sort of like mature up then there are a lot of internal business processes a lot of other uh consumer interfaces uh that can be made a lot better with machine learning and the aggregate effect of that usually ends up surpassing uh the tent pole use case and that that is something that we have seen quite often uh not only at Netflix but of course you know we have a popular open source project called metaflow uh which sort of like helps organizations build machine learning models and deploy that in production as well so we've been able to sort of like see uh the impact uh across other companies too so um there are like companies like 23 and me uh who um use machine learning as well as data science uh to uh predict uh you know uh traits and outcomes uh using genetic data and that has been uh quite remarkable uh in terms of um just doing First Rate uh medical research we have companies like you know in the similar space like metronics uh we been focused on surgical Ai and that's yet another area if you just look at drug Discovery or automated surgeries the amount of um value that can be generated uh and the actual real world impact uh to people's lives is is really great uh another example uh that I came across during the pandemic uh was this company called zipline uh they buil automated drones uh that are responsible for bulk of drug supplies including covid vaccines in subs Southern Africa and uh their drones are model uh like they are powered through machine learning algorithms as well and at the end of the day of course you know we look at sort of like uh the big Tech sector and how sort of like you know uh interesting consumer applications are made U that sort of like you know is always is uh getting the Limelight but when you look at sort of like the real world impact that is actually really sort of like uh shaping uh people's day-to-day um I think that's that's been sort of like you know a lot more rewarding and enriching to just see sort of like from the sidelines at times absolutely so some very different examples there I mean uh I suppose the the Netflix example where you're getting a better movie recommendation it's like it's a small benefit to a lot of people there because it's scaled so much whereas on the other extent the idea of like um a AI surgeon that's like that's helping one person a lot so you got that um the very different use cases there um all right so uh let's get into a bit of depth on uh some of these steps that are involved in um like before you deploy your code so um what sort of tests do you need to perform on your model be so that you know it's ready to go live I it it depends uh you know as with anything else um so you can imagine the bar I mean of course you know like now nowadays if let's say the Netflix recommendation starts aing crappy recommendations then uh it'll become a social mean so in many ways the amount of due diligence that somebody needs to push out uh a new machine learning model for recommendation systems for popular services like Netflix Spotify and whatnot is sort of like considerably higher but nobody is really going to lose their life right I mean it's it's not a life altering thing if you push out a bad ml model versus if you're looking at let's say the healthcare space or the fintech space then any kind of bias or any kind of issues with the model uh can have life-altering outcomes uh so the bar sort of like you know really sort of like goes up significantly and um fortunately or unfortunately that sort of like varies from uh industry to Industry as well as the application that you're sort of like actually using it so there isn't sort of like a one- siiz fit all uh answer but the most important thing always is that as a data scientist or as a data professional do you have full understanding full view of uh how exactly a model is actually going to be used and are you able to sort of like control and observe those outcomes and really sort of like iterate from that and we've seen that as sort of like you know one of the more primary uh failure mechanisms where you can imagine U if let's say putting a model to production or doing data science involves many many individuals in an organization on one single project project then uh many of these important concerns become nobody's problem because there's no one single individual who has complete control over the end to-end process and that's been sort of like one of the biggest uh push that we have been advocating for that how do you essentially make your data scientist a full stack data scientist or an endtoend data scientist so that they have uh full control uh around the model the kind of data that they are uh building these models on when are these models actually getting rolled out what are the impact of these models on real world outcomes and not just sort of like you know statistical measures okay so this um interesting she's saying one of the biggest problems then uh in terms of um maybe like quality control is going to be the fact that there's lots of different people doing different stages of the workflow so in order to eliminate that you have one person who is responsible like from end to end uh this is give be a full stack data scientist upun uh do you want to tell me more about this role and what it would involve yeah yeah so so you can imagine you know like uh if we let's say hearken back to the 70s or the 80s and I mean this might even be true uh today as well in many places um shipping software was really expensive uh shipping software of any kind uh so of course you know you had sort of like your development team composed of software engineers then you had a testing team composed of QA engineers then you had a different group of people who are focused on shipping software uh through sort of like you know uh these release managers and then you had your database admins and application Architects and um sres in the mix as well and that sort of like raise the cost of doing anything in software engineering of course there are plenty of areas where uh having uh specialized roles makes uh a lot more sense uh of course definitely you know when you sort of like get to a certain level of scale as well uh it's a lot more important to make sure that you have multiple eyes but if you have um five or seven of these different roles involved for every single project than doing even the most simple thing becomes a lot more expensive a lot more time consuming and uh that's been sort of like you know one of the big promises um of the modern devops movement as well as the Advent of the cloud that how can you have basically one single software engineer who can run the entire gamut I mean of course you know there isn't an expectation that overnight they're also sort of like you know significantly skilled at building uh you know front end as well as backend and sort of like you know being an expert around cicd but the tooling uh has evolved and matured where you can have an expectation that a small software engineering team can return outsize returns and um we're basically hoping for the same thing uh on the data science side of the house as well that how do you basically ensure that it's not only sort of like you know these 10 pole use cases that are reserved for machine learning and you can ensure that a single data scientist or a small team of data scientist uh can really sort of like deliver outsized business impact by running the entire life cycle of a data science project right so imagine sort of like all the way from how do you figure out uh what data does an organization have is that data available in the right format does that data have the right quality can I access that data in an interactive manner so that I can play around with that data and really sort of like understand um what are the possibilities that that data has encoded then at the same time do I bring in sufficient uh business skills to the table do I understand the business perspective that my organization is involved in so that I can marry my data science skills uh with this business perspective and uh start to figure out you know what are the best ways for me to optimize this specific business problem and what we have seen historically is that data scientists come from uh quantitative disciplines uh that may not be software engineering related right and that's where sort of like one of the first uh bottleneck uh presents itself that if let's say you're playing around with a lot of data or if you're in an organization where you know everything needs to happen uh within a certain specific governance boundary of either your data warehouse or within a specific cloud and how do you interface with that cloud how do you sort of like you know uh interface with all of the engineering and uh business architecture complexity uh is sort of like you know one of the big areas where uh people see declines in productivity and that then sort of like necessitate specialized roles to sort of like come in and you oftentimes end up in a problem where Things Fall look tracks or certain important things that are like okay you know like how do you figure out if the right thing is happening unfortunately becomes nobody's problem because there's no one single individual who has complete inin perspective uh so that was like one of the big reasons why we ended up creating uh metaflow back at Netflix which is sort of like uh an open source machine learning platform and it's geared towards ensuring that a data scientist can become this full stack into a data scientist and they can control more of their destiny and that should in theory then allow them to iterate uh on their machine learning models on their data science projects a lot more quickly and uh in that scenario you then also able to sort of like you know gradually move the interface between data science teams and other software engineering teams if you realize like a decade ago when I started my career um data scientist uh role used to be limited to prototyping on a laptop and then there would be teams of Engineers who would take that prototyp I scale it out and then deploy it and off you go right and the scenario there was that of course you know as you scale out your model training on a lot of data then many of the statistical guarantees that a data scientist was looking out for May no longer hold true and uh the software engineering team doesn't really understand the intention behind anything that a data scientist was trying to do the data scientist has no idea what got shipped into production and uh it was sort of like anybody's guess if even the right thing was happening in production and now we are sort of like getting into a point where uh a data scientist should be able to let's say expose an API to their work maybe that API is a model maybe that API is an actual rest endpoint that you can call into maybe that API is a memo that has been written for consumption by sort of like you know other teams but at least sort of like you know that provides a lot more control a lot more visibility uh to a data science team and sort of like shipping up their work lots to cover there um and we'll certainly get into metlow later but um it sounds like when you got this idea of full data science they need the data science skills they need software engineering skills they need business skills this seems like it's going to be quite difficult to hire for and I'm wondering whether there's a tradeoff between having a large number of more specialized individuals versus having this generalist who can do everything do you want to tell me how like um a data team's going to be comprised then would you I presume you wouldn't want all full St data scientist or all uh more specialized individuals you can to mix the two like how does it work yeah yeah I mean I think the answer uh is better tooling at the end of the day uh of course if the expectation is that we are able to sort of like you know find somebody who is uh really great at engineering really great at data science really great at uh business sense uh that's a proverbial unicorn uh you may be able to find a few but definitely that's not sort of like you know uh a strategy that can scale out and and uh now definitely uh two out of three uh is something that would be desirable finding somebody who is sort of like you know equally good at data science as well as understanding the business intimately I think uh that in many ways is sort of like the minimum bar on hiring an exceptional data science is uh but paired with great tooling uh you can definitely sort of like ensure that the level of abstraction that they are working on uh at least The Accidental complexities of the cloud can be taken care of uh for them and and then uh you can expect that this particular data scientist to then sort of like you know take care of the business complexities and the data science complexities uh so that's that's basically sort of like is where I see many uh data science teams to be moving towards as well uh I think you know that's sort of like one of the big reasons why uh many companies have also invested in their internal ml platform teams as well uh where the entire prerogative of these themes is to provide these set of tools internally sort of like you know a point of Leverage um so that every single data scientist is insulated from the harity of engineering uh but the interesting sort of like Dynamic there is that how do you build tools that are really good at navigating uh around this fact that there is some amount of complexity that a data scientist would want to take care of and then there is some amount of complexity that they would want the tool to take care of and how do you sort of like thread that balance it's always an interesting question okay definitely uh and yeah I definitely wanted to talk about Tools in a bit but uh just to press you on this idea of teams then so if someone says okay I'm a regular data I'm a regular data scientist I want to become a full stack data scientist um what needs to happen to take that extra step I think the question there is that if you're not this full stack data scientist then what kind of data scientist are you and um my colleague uh here at outer bounds uh sort of like you know his favorite term um is um laptop data scientist uh somebody who is like you know very well Adept with let's say the pythonic ecosystem or just like you know uh everything that's available on the machine learning side of the house and is uh able to understand uh the characteristics of the data and uh sort of like you know get their work done so there's sort of like that aspect and then on the other uh side of the house you need to figure out as an organization how do you actually ship value through data science right and then there's sort of like the Gul of complexity that you need to cross in between and um one question or like one way is that yes you know like you become equipped at um handling that Gulf of complexity all by yourself and example clear-cut example here would be let's say uh you want to train uh many different machine learning models uh using gpus and um you're sort of like you know constantly I trading on sort of like you know different hypothesis and uh that GPU form very likely is not going to be your laptop right like it's going to be something uh in the cloud uh let's assume that you know it's kubernetes the most popular uh computer orchestrators out there now on one side I can have an expectation that maybe my data scientist understands the integrities of kubernetes ecosystem and how to sort of like you know run these kubernetes spots reliably and manage them and monitor them and when things go wrong knows his or her way around uh debugging um these failures uh but it's it's a very complicated landscape and unfortunately if it was only kubernetes that uh people had to worry about life would still be easy then you also have to worry about that okay I have my data that needs to come from somewhere needs to go somewhere else couple that with kubernetes how do I sort of like think about that I'm constantly experimenting reproducibility can be a big problem if let's say my colleague is running into a failure I'm supposed to help them out but if I cannot reproduce or replicate that same failure what are my odds of even sort of like you know being capable of helping them out one bit so the complexity very quickly multiplies and uh there are now multiple tools in the space that sort of like help uh in this specific area so uh making themselves well apprised uh of the latest and greatest tooling would be one I think as practicing software Engineers um it falls on us as well to sort of like really understand where the world is headed uh what are the new paradigms uh around engineering and I think it's the same thing that I think most data scientists also understand that uh for them to sort of like stay relevant uh they also need to sort of like equip themselves with if not necessarily the details of every single thing out there uh at least sort of like understanding what are the layers of abstractions that are available on top of these building blocks that can help them uh get their work done okay so uh really just make sure that you're on top of like the latest tools that's going to stand you in good setad for um improving your skills okay all right so um let's talk about metaflow since uh since you're the creator of it uh so uh to begin with can you tell me what does metaflow do yeah uh so so metaflow is an open source ml platform uh to put it successfully it helps you train deploy uh ml models and uh Target it towards essentially building ml systems I think you know in this conversation we have spoken a lot about data science and machine learning per se but at the end of the day an organization is trying to build a system and a machine learning model is only a part of that system so so how do you basically get to building these systems which can often times be complex uh they may cross uh Team boundaries uh they may interface with uh significant engineering infrastructure how do you basically ensure that um a data scientist or a team of data scientists is capable of doing all of that is basically what metaflow uh stries to do okay so it's just going to help you uh take the sort of steps into getting your code into production um now I know there are just dozens and dozens of mlops tools so can you talk me through how metaflow fits into this larger ecosystem of tools yeah yeah so I I can walk you through um what sort of like you know even prompted us uh to start the project in the first place um now of course you know many of these uh tools uh they have taken a life of their own and sort of like you know they uh cater to uh different markets or uh different use cases uh what meta flu is targeted towards is a practicing data scientist so it is not a low code no code solution it is a solution that's targeted towards a data scientist who either understands python or R really well so and um also brings in that data science understanding to the table so we are not in the business of teaching people uh how to do data science uh metaflow is a tool that enables people to do data science well uh so that's that's sort of like you know is the big uh thing here um so we started working metaflow back in 2017 so gosh it's like now close to seven years and we were at a spot uh back at Netflix where uh Netflix was now looking into investing in machine learning across the entire life cycle of their business right so not only sort of like how do you do recommendations well but how do you construct a portfolio of content uh that is going to drive your subscription base uh higher and higher up uh how do you figure out what is the best content that's available uh how do you leverage economies of scale in either licensing or producing that content how do you take these bits and stream it to people's TVs and mobile devices so that they have an amazing uh streaming experience uh how do you fight fraud how do you take care of pricing challenges you can imagine you know like if you start thinking about all the places where you can start investing in from a machine learning standpoint um at a company like Netflix and it's really sort of like you know many ways like a kid in a Candy Land and the other sort of like interesting aspect was that while Netflix is usually lumped into sort of like you know this cohort of fang companies and there sort of like a connotation with Fang scale I think Netflix is a lot more closer uh to your average uh company that is on the public Cloud uh but it's sort of like you know just a whole bunch of different interesting problems that Netflix is trying to solve that like adds to that complexity so we were now getting to a spot where the solutions the tools that we had built for our recommendation systems had sort of like served us really well uh they were predominantly built on top of the jvm stack that was sort of like really popular uh in sort of like you know uh the uh early to mid uh 2010s and now we were coming to a spot where the number of people who were excellent at engineering at data science at business uh were very few and of course now if you have to start investing in many different areas of data science you have to then sort of like you know pick and choose your battle and we said that okay of course you know like we can't really sort of like skam on hiring the very best data science Talent uh that's available uh but then of course you know somebody has to come in and uh really sort of like PVE over uh that Gulf of engineering complexity and uh then the goal for us was that okay how do we basically realize the stream of making our data scientists full stack data scientists how do we basically provide them Solutions where uh they can get all of their work done on their laptop uh but we can bring the cloud to their laptop right so so you can basically sort of like imagine can I sort of like you know provide uh hundreds of thousands of CPU cores and hundreds and thousands of gpus and uh terabytes and pedabytes of ram to your laptop uh so that you don't have to become a cloud engineer uh to scale out your machine learning projects uh how do we ensure that people can take a graduated approach because you can imagine you know like not every single project uh will be a humongous scale machine learning project but then at the same time every single machine learning project will help uh uh or will will go sort of like you know much better if there is some amount of discipline picked in right I think there have been plenty of times when uh we have run into this issue where something works on my laptop but does not work on my colleague's laptop or uh I'm able to let's say you know install a version of pyarge today but uh not able to install the same version of pyarge two days later or you know a transitive dependency has changed and something is like subtly off and I'm sort of like you know trying to figure out what went wrong where so it's sort of like like a barrage of small little problems as well as you know some rather nefarious problems as well uh particularly on the computer and the data side that you have to sort of like you know start worrying about as a data scientist and uh our goal was that okay can we ensure that they don't necessarily have to worry about it and can they just like squarely Focus their efforts and energy on rangling the data science complexity that's what their expertise is in and if let's say Netflix as an organization is expecting them to spend a lot much more time wrangling of like you know nefarious issues with like Hey how do I move data from my data warehouse to my GPU instance so that my GPU Cycles are not wasted that's that's not a thing that a data scientist should be focused on very early in U data science project because you don't even know at that particular point if the approach that you are taking for your machine learning model is even worth it but if you sort of like you know take an aggregate view if you have hundreds of data scientists all running gpus suboptimally then that sort of like expense really adds up uh to a non-trivial number that as a platform engineering team I do indeed have to care about but if you can codify all the best practices and uh provide a user experience that is a lot more humancentric that works with the data scientist that a data scientist doesn't have to fight against then it becomes a lot more easier where by default all the right things happen on the engineering side of the house the data scientists their freedom of choice uh in terms of how they want to navigate the web of data science complexity is preserved and the organization then sort of like you know gets to benefit both from uh cost optimization because ml can become expensive at times if you're not careful about it as well as making sure that you know you're able to innovate uh quite actively and sort of like you know quite quickly one thing you mentioned is um the idea of working on a laptop and so there's been a huge Trend in the last decade or more about everything is going to the cloud so the idea of working on a laptop sometimes but also having access to like these you know large scale compute that's in the cloud um that sort of indicates some kind of hybrid Computing is is that the approach you push for or um is is there like a reversal of the incloud trend yeah yeah no I mean of course you know there are like many benefits of being in the cloud for example uh if let's say all your data is in the cloud you don't want any of that data to ever leave the cloud it's like one big reason purely from a security standpoint why everything would want to happen in the cloud but your laptop can still be the interface to that cloud right um so from that point of view you might still be accessing all your resources through the laptop but the code that you might be writing might actually uh be running entirely in the cloud and the data may never actually sort of like you know show up on your laptop and everything might happen sort of like through your ID or through your browser so that's sort of like one universe and then the other universes at times I mean you know like the problem may not be sort of like uh the problem that you're trying to solve require uh very steep computational resources or managing a lot of data right uh the data could be enough to F fit in your laptop it may not be sort of like super sensitive uh you could use something like psych learn or you know like many other sort of like popular Frameworks uh to build your machine learning models I mean the number of things that my MacBook can do these days I mean it's just like um Beyond imagination uh but then what you still sort of like need at that particular moment is uh still some discipline right uh you still want to figure out how you're going to catalog your experiments you still want to figure out what is the best mechanism uh for you to ensure reproducibility uh so that you know you're able to sort of like understand how your models are behaving or if you need to course correct then you're able to do that easily and at times let's say you know many times one definition of uh productionizing your model uh can be that okay whatever work you have done on your laptop now of course at the end of the day you're going to shut your laptop close and you're going to go back home but maybe you want to sort of like you know run that model training process every single night every single week maybe when new data shows up how do you sort of like you know push that into the cloud and I mean it could be your on-frame infrastructure as well right but basically how do you sort of like you know take something that is running on your laptop and reliably run it elsewhere uh that can be sort of like one definition of productionizing your machine learning model uh in a variety of projects and um that can be a big um activity uh where a lot of organizations May sync a lot of resources where the data scientist was able to prototype something on their laptop but now just this process of converting their wees into something that can be run outside their laptop can be an activity that is sort of like measured in months or quarters and uh for us that was another goal that okay can we sort of like take all the work that a data scientist is doing whether that work is in the cloud or whether it is in the laptop but it is available in a format such that it can be run anywhere else uh almost immediately so so you don't have to sort of like any at least then uh worry about this process of like okay I had something that was running but now I need to go back to my manager and sort of like can ask for another one month before so that I can come back address all of my pain points and issues and refactor my code so that it's sort of like not worthy to P in production but can we sort of like you know just flip the script such that the infrastructure basically allows you to do all the right things from the outside okay yeah certainly that's um like anything involving package management like environment envir and just having things not working from one machine to another that can be incredibly frustrating uh if you've got a deal with this manually um so the idea of reproduc reproducibility is one important uh aspect of getting things in production um another thing seems to be scalability so um once you C go into production like you maybe got um it being your model being accessed by millions of people um how do you make sure that your code is going to be um scalable I think in many cases especially sort of like you know on the consumer side of the house many times it might not be feasible to understand what kind of scalability requirements you're gunning for um in the first place right in in many cases it is indeed possible especially if let's say you know this model is a subsequent version of a previous model but if you're on a net new project um especially on the consumer side of the house um with virality loops involved uh the kind of scale that you may run into uh May quite be unpredictable um I mean at the end of the day it sort of like all boils down to uh the question of project management and just like software engineering skills that if you let's say deploying a model let's say in this case if you're talking about um think let's let's think about let's say you know um recommendation systems right I mean because that's one area that I'm very well familiar with uh if you think through let's say users tastes and preferences right so if you're let's say on Spotify or if you're on Netflix then uh there isn't sort of like a lot of brand new content that is coming in very very quickly as well as your tastes are not really changing uh very quickly either and you already know what your entire user base looks like and what their preferences are so you can pre-compute those recommendations and then just serve those recommendations from a database right so you're not sort of like doing any kind of live uh model inferencing and that has like you know amazing scalability benefits it's a very simple straightforward approach but of course it may or may not work in many use cases um so my recommendation for f is if you already understand what your scalability uh metrics are that you're trying to achieve then there's always sort of like um an architecture uh that's possible uh of course you know the amount of expenditure that you're willing to uh sort of like um incur in that project is also a big input to that uh but don't overthink it uh don't prematurely optimize it um and um there's there sort of like you know plenty of hacks and approaches that people can take to at least sort of like buy more time uh before you really understand sort of like you know what kind of uh next scalability benchmarks you need to be sort of like going for um I think there's sort of like a similar scalability hurdle uh that is present on the model training side as well uh at times and um that's that's sort of like you know in many ways a silent killer for many organizations where uh usually sort of like the deployed model of course you know it's sort of like generating business value so you want to really make sure that that sort of like works well and there's like a lot of um uh light that's shown on sort of like those use cases and then you have let's say some models that you're training that would be consumed directly in production and people are also sort of like you know ensuring that yes you have let's say all sorts of alerts and observability that's sort of like instrumented on so that those models are actually generated on time uh with sort of like you know reliable results but the third sort of like bucket where you have teams of data scientists who are actively experimenting uh that can be a big uh cost Vector as well and what we have seen many times is that the overall Cloud costs for experimentation can be orders of magnitude higher than your um cost uh that a deployed model is incurring at times because you may have like you know hundreds of models that you're experimenting with but only a few models or maybe tens of models that are sort of like deployed uh in production and um the UN reality is that with experimental models it's also very hard uh to sort of like you know ensure that uh from a cost efficiency standpoint uh you're sort of like you know as a data scientist you have uh complete awareness of how you want to actually optimize that model training or if you want to let's say scale out that model training uh then what kind of engineering effort would be needed as well so one uh interesting sort of like example that comes to mind here is that I was working with a data scientist and they basically uh guess you can sort of like debate whether this was a good reason or not but they were basically trying to um predict what content is going to be popular in any given geography uh at any given moment so that uh the content distribution Network the CDN infrastructure can be seeded with the right kind of content right so imagine if let's say you know your or a company like Spotify and you know that certain kind of music is going to be popular in certain geography at certain hours or if you're a company like Netflix who is releasing a brand new show and they have done a significant amount of marketing in let's say Australia then you would want to make sure that people have an amazing experience streaming that content and you don't run into sort of like rebuffering neighborhood in the world and they decided to build 60,000 models in one shot each of those model required a container with a GPU and now this is immense amount of compute that you're running right and ahead of time you don't even know what is the ROI of this entire effort going to be uh and if let's say it is just to a data scientist it can be a significant engineering challenge all by itself that how do you run this much amount of compute uh without being uh a professional Cloud engineer and even for a single professional Cloud engineer this s like you know often times bridge too far and um with metaflow they were able to run that compute very seamlessly and at the end of the day they sort of like you know also got a nice bill of like okay here's how much money that you have spent and of course you know like uh when you start spending so much amount of money some eyebrows are often times raised and uh so of course you know people sort of like wanted to understand uh whether uh the spend was sort of like worth it and you can imagine you know like in this case uh obviously uh the amount of capital that this company was able to save in terms of sort of like their CDN optimization was well well worth this expense of sort of like you know running sort of like 60,000 gpus fully engaged for like multiple days um many times it may not be um and just like you know having that perspective uh at times as well that okay you may want to scale but is that scale actually linked to your business outcomes or not can be a lot more important too yeah I can see how you certainly want to just speak to some other people before you um fire up 60,000 gpus and run them for a few days yeah probably get best to get that business alignment first um so um I'm getting the big theme of that is like just um don't do calculations that you don't need to do and just make sure you have metrics around like how performant you need things to be um so I think we talked a little bit about reproducibility but scalability the other aspect of things in production seems to be robustness because as soon as things such users they're going to give you stup impuls things will behave a weird ways um do you have any tips for how to make your um your data programs models whatever uh more robust yeah so I mean of course you know like we have to think through robustness through the entire sort of like layer of the ml infrastructure stack uh in many ways so of course there sort of like uh just a thing that okay does your model actually encapsulate the behavior that it is trying to predict uh have you taken into account sort of like you know things like seasonality and all of that so um I sort of like put all of those concerns on the data science side of the house that are data scientist needs to sort of like worry about in many ways versus as a tool Builder what I sort of like personally um love to focus more on is that is the underlying infrastructure robust enough uh because many times if your infrastructure gives way and you're unable to let's say generate a fresher version of the model then your model performance is going to a hit and that will have a direct sort of like you know hit to the business kpis as well and then the question sort of like becomes said okay how do you sort of like get to a point where your infrastructure is robust but then more importantly of course in the age of cloud you can't sort of like promise 100% up time for any piece of infrastructure and you can sort of like increase your robustness rates but I think as you pointed out right like dependency management can be a big issue so today you are able to install pyos tomorrow you may not be able to install the same version the exact same way and what do you do at that particular point and the big question then becomes is how do you quickly recover from these errors or how do you quickly recover from these failure mechanisms if let's say you have um so let's say you have a training pipeline uh to train a model but that training pipeline to train that model depends on yet another Upstream pipeline that is generating some embeddings now if that uh embeddings pipeline is failing for whatever reason of course your Downstream uh model training pipeline cannot execute and there may be sort of like you know other processes as well that are dependent on it so it becomes very imperative to figure out what is the quickest way to be able to diagnose what went wrong with your embedding Pipeline and how do you basically recover from that failure so that your subsequent pipelines can start sort of like you know executing and that is often times one of the areas that can be quite underinvested in an organization where uh doing machine learning is so difficult at times involves so many different moving pieces that the focus is on getting the happy path working and not necessarily focusing a lot on when that happy path is not quite happy when failures happen how do you sort of you recover from that and uh the complexity arises from the fact that so many different things can fail uh right I mean a lot of people sort of like um these days uh they try to find cheaper GPU resources so they end up going to let's say you know a cloud provider that might not be sort of like one of the more prominent hyperscalers and then they unfortunately realize that the machine that they are buying it was advertised to have four gpus attached to it but unfortunately only has three working GPU drivers and that's why sort of like you know certain things are slow or failing and then figur out how to recover from it right or um your data uh changed uh for whatever reason and now middle of the night you have to wake up and you have to sort of like like you know step through your work and even sort of like replicating what failed can be at times really tricky and then figure out that okay this was the actual change in data uh that sort of like uh was the cause for trouble either patch your pipeline or wake up the person who was responsible for the Upstream data pipeline so that they can sort of like fix the error and that's that BEC sort of like you know one of the bigger themes that as an organization how do you basically recover from these errors your uh MTR how do you sort of like lower that down and that was like one area that we sort of like uh focused on quite uh a lot and I think that sort of like also pairs into this notion of reproducibility as well that uh while yes you want to reproduce the good behavior of the model so that you have more trust in that model but you also need to be able to reproduce the failure patterns uh somewhat reliabil of course I mean there's a lot of stochastic ISM involved in failures as well but at least for certain class of failures if you're able to reproduce them reliably then you also stand a short and uh being able to fix those uh quite quickly andile okay so I know being on call is like a standard feature of being a software engineer waking up at like 3:00 a.m. to try and fix some data pipeline or even worse having to wake your colleagues up as well to say okay can you help you debug this at 3: a.m. that seems like something I wouldn't want to do on a regular basis so can you talk me through like what sort of processes should you put in place to make sure that you don't have regular failures like how can you improve that reliability yeah I mean the thing is at the end of the day you know if let's say you have like one simple uh strategy here is if you have a pipeline that is running let's say every week and if it is super critical then you may want to run that pipeline every day but only consume the output every week right so that then it Le sort of like you know if it fails in between then it's not sort of like something that you need to wake up in the middle of the night you have an entire business day or an entire week or like half a week on average to actually sort of like fix up the issue before sort of like it becomes a burning issue so playing with sort of like you know that frequency um sort of like Arbitrage uh on the training side uh is almost always sort of like useful of course uh it comes with sort of like you know uh extra cost as well and many times that cost May well be worth it uh because if it is not sort of like super urgent then why sort of like wake up in the middle of the night you can sort of like wake up during business hours and address it on the uh model inferencing side um I think this is sort of like you know many many techniques exist from the software engineering standpoint uh but the one thing always is that you know always have sensible defaults so for whatever reason your machine learning system is down you can always sort of like fall back on htics or certain sort of like rules uh so that the End customer experience isn't impacted I think that's sort of like you know one of the more common failure mechanisms that uh I end up seeing unfortunately where because a machine learning system is down uh the sort of like an immediate customer impact or something is broken many times it's unavoidable but uh definitely sort of like um the systems can be designed in such a way that the user may get a subpar experience but at least the critical functionality is not entirely impair okay so you want to kind of fail gracefully and just like maybe cut off one bit that's not working and have everything else work all right nice okay so uh before we wrap up what are you most excited about in the world of mlops at the moment yeah so I think uh you know definitely couple of years ago uh especially when chat gbd came out uh there was sort of like this uh conversation uh going on on Twitter or x that you know is this the death of data science uh well sort of like you know data scientist as a job function even exist and I think what I'm sort of like quite excited by these days is that I think um that maturity in conversation is coming back in the sense that I think now people are understanding that uh gen AI plus predictive AI is a thing it's not as if sort of like you know one strategy or one approach is going to completely up in at the end of the day you need to build machine Learning Systems uh these machine Learning Systems may be a combination of many different models which might be built using a variety of different ways so even if you're building a recommendation system your recommendations could be coming from let's say a deep learning model but somebody still needs to convince the end user that those recommendations are indeed worth their time and uh that sort of like you know compelling uh narrative uh could be derived from an llm and uh the album art could again be sort of like you know some gen image model that can sort of like you know uh convince you that yes you know like that particular content or that particular song is definitely sort of like worth your attention so I think now people are really sort of like warming up to the idea uh that it's sort of like multiple uh ml models all working in cohesion that actually sort of like you know Drive um a strong consumer experience or are able to uh enable an organization to optimize their internal business processes and that's that's always exciting from a machine Learning System uh point of view like you know how do you sort of like you know then uh tackle this increased diversity or increased complexity uh in any system okay so having just like really complex systems of like lots and lots of different models all working in harmony um yeah that that that maybe sounds like not step one if you're trying to uh like start putting things in production but maybe that's that's a very good end goal to work towards all right super uh do you have any final advice for organizations wanting to start getting their machine learning in production yeah so I mean of course you know uh machine learning at the end of the day it's not a silver bullet uh it requires it's it's it's experimentation at the end of the day and you may end up investing a lot in data science and not really see any results for a really long period of time and what really matters is ensuring that you have a plan um before you sort of like you know start investing in ml uh you have the right expectations uh the right time Horizon and more importantly you have uh either the right support structure around data scientists and this could be sort of like you know in the form of ensuring that you have people who understand infrastructure as well as data science really well and are able to work well with one another or in the absence of that you are able to invest in create tooling uh from the onset so that your data scientists at the end of the day are able to experiment a lot more effectively because um the the one sort of like uh sure short way of failing in machine learning is by not being able to experiment enough if your data scientist is only able to ship one version of a model in a week a month or a quarter whatever that time Horizon is that may just not be enough if you're able to really sort of like you know ensure that their iteration loops are measured in minutes or maybe hours then that sort of like is a good way of ensuring that the quality of your machine learning model sort of like you know continues to go up uh eventually to a point where it may beat uh certain predefined rules or heuristics and then that's when sort of like you know you start uh reaping the benefits of your investment and machine learning okay uh so yeah uh that sounds like a a great advice um thank you very much for your time s yeah thank you thanks for having me with you oh\n"

#225 The Full Stack Data Scientist _ Savin Goyal, Co-Founder & CTO at Outerbounds

Random Videos