#5 Machine Learning Engineering for Production (MLOps) Specialization [Course 1, Week 1, Lesson 5]

Deploying a Machine Learning Model: Considerations and Best Practices

The compute resources of the cloud allow for more accurate speech recognition, but there are also some speech systems that run at the edge, such as those found in cars, and mobile speech recognition systems that work even without an internet connection. These examples highlight the importance of considering deployment options carefully.

When deploying visual inspection systems in factories, it is common to run them at the edge due to the potential for intermittent internet connections. This approach can be crucial when ensuring continuous production operations are not disrupted by such issues. Furthermore, with the rise of modern web browsers, there are better tools available for deploying learning algorithms directly within a web browser.

When building a prediction service, it is essential to consider the compute resources and memory requirements. Overly complex models may require significant computational power, which can be a challenge when deploying them in production. For instance, training a neural network on a powerful GPU only to find that deployment with an equally powerful set of GPUs is not feasible highlights the need for model compression or reduction.

Latency and throughput are critical considerations in speech recognition systems. A common requirement is to achieve a response within 500 milliseconds, with allocated latency being as low as 300 milliseconds. This places significant pressure on system design and optimization. Throughput refers to the number of queries per second that must be handled by the system, which can be impacted by compute resources and cloud server specifications.

Logging is another critical aspect of deployment. Recording as much data as possible for analysis and review provides valuable insights for retraining learning algorithms in the future. This approach enables better model performance over time.

Security and privacy are paramount considerations when designing machine learning systems. The level of security and privacy required varies significantly depending on the application, with sensitive data such as electronic health records necessitating enhanced protection measures. Regulatory requirements can also play a significant role in shaping these considerations.

In summary, deploying a machine learning model involves two broad sets of tools: writing software to enable deployment and monitoring system performance to maintain it over time. The first deployment phase is often distinct from subsequent updates or maintenance due to the need for continuous data feeding and model updates.

Contrary to popular opinion, deploying a machine learning model does not constitute the final hurdle; rather, it marks only one stage of the process. Engineers who view deployment as the finish line are partially correct but overlook the significant work required in maintaining and updating existing systems.

The first deployment phase presents unique challenges, such as deploying a speech recognition system for the first time versus updating an existing implementation. Different requirements and considerations arise from these distinct scenarios, underscoring the need for careful planning and consideration.

Common Design Patterns in Machine Learning Deployment

In machine learning model deployment, several common design patterns are employed across various industries. Understanding these patterns can help developers select the most suitable approach for their application. The next video will delve into some of the most prevalent deployment patterns used in machine learning applications.

"WEBVTTKind: captionsLanguage: enone of the most exciting moments of any machine learning project is when you get to deploy your model but what makes deployment hard i think there are two major categories of challenges in deploying a machine learning model first are the machine learning or the statistical issues and second are the software engine issues let's step through both of these so you can understand what you need to do to make sure that you have a successful deployment of your system one of the challenges of a lot of deployments is concept drift and data drift loosely this means what if your data changes after your system has already been deployed i had previously given an example from manufacturing where you might have trained a learning algorithm to detect scratches on smartphones under one set of lighting conditions and then maybe the lighting and the factory changes that's one example of the distribution of data changes let's walk through a second example using speech recognition i've trained a few speech recognition systems and when i built speech systems quite often i would have some purchase data so this would be some purchase or license data which includes both the input x the audio as well as the transcript y that the speech system is supposed to output and in addition to data that you might purchase from a vendor you might also have historical user data of user speaking to your application together with transcripts of that real user data and such user data of course should be collected with very clear user opt-in permission and clear safeguards for user privacy after you've trained your speech recognition system on a data set like this you might then evaluate it on a test set but because speech data does change over time when i build speech recognition systems sometimes i will collect a def set or hold up validation set as well as test set comprising data from just the last few months so you can test it on fairly recent data to make sure your system works even on relatively recent data and then after you push the system to deployments the question is will the data change or after you've run it for a few weeks for a few months has the data changed yet again because of the data has changed such as the language changes or maybe people are using a brand new model of smartphone which has a different microphone so the audio sounds different then the performance of your speech recognition system can degrade and it's important for you to recognize how the data has changed and if you need to update your learning algorithm as a result when data changes sometimes it is a gradual change such as the english language which does change but changes very slowly with new vocabulary introduced at a relatively slow rate and sometimes data changes very suddenly where there's a sudden shock to a system for example when covet 19 the pandemic hit a lot of credit card fraud systems started to not work because the purchase patents of individuals suddenly change many people that did relatively little online shopping suddenly started to use much more online shopping and so the way that people were using credit cards changed very suddenly and this actually tripped up a lot of anti-fraud systems and this very sudden shift to the data distribution meant that many machine learning teams were scrambling a little bit at the start of covid to collect new data and retrain systems in order to make them adapt to this very new data distribution sometimes the terminology of how to describe these data changes is not used completely consistently but sometimes the term data drift is used to describe if the input distribution x changes such as if a new politician or celebrity suddenly becomes well known and is mentioned much more than before and the term concept drift refers to if the desired mapping from x to y changes such as if before covert 19 perhaps for a given user a lot of surprising online purchases should have flagged that account for fraud but after the starter cover 19 maybe those same purchases would not have really been any calls for alarming in terms of flagging that the credit card may have been stolen or another example of concept drift let's say that x is the size of a house and y is the price of a house because you're trying to estimate housing prices well if because of inflation or changes in the market houses maybe become more expensive over time then the same size hulls will end up with a higher price and so that would be concept drift maybe the size of houses haven't changed but the price of a given house changes whereas data drift would be if say people start building larger houses or start building smaller houses and thus the input distribution of the sizes and houses actually changes over time so when you deploy a machine learning system one of the most important tasks will often be to make sure you can detect and manage any changes including both concept drift which is when the definition of what is y given x changes as well as data drift which is if the distribution of x changes even if the mapping from x and y does not change in addition to managing these changes to the data a second set of issues that you will have to manage to deploy a system successfully are software engineering issues when you are implementing a prediction service whose job it is to take queries x and output prediction y you have a lot of design choices as to how to implement this piece of software here's a checklist of questions that might help you with making the appropriate decisions for managing the software engineering issues one decision you have to make for your application is do you need real-time predictions or are batch predictions okay for example if you are building a speech recognition system where the user speaks and you need to get a response back in half a second then clearly you need real-time predictions in contrast i have also built systems for hospitals that take patient records take electronic health records and run an overnight batch process to see if there's something associated with the patients that we can spot so in that type of system it was fine if we just ran it in a batch of patient records once per night and so whether you need to write real-time software they can respond within you know hundreds of milliseconds or whether you can write software that just does a lot of computation overnight that will affect how you implement your software and by the way later this week you also get to step through an optional programming exercise where you get to implement a real-time prediction service on your own computer you see that at the optional exercise at the end of this week so second question you need to ask is does your prediction service run in the cloud or does it run at the edge or maybe even in a web browser today there are many speech recognition systems that run in the cloud because having the compute resources of the cloud allows for more accurate speech recognition but there are also some speech systems for example a lot of speech systems within cars actually run at the edge and there are also some mobile speech recognition systems that work even if your internet even if your wi-fi is turned off so those would be examples of speech systems that run at the edge when i'm deploying visual inspection systems in factories i pretty much almost always run that at the edge as well because sometimes unavoidably the internet connection between the factory and you know the rest of the internet may go down and you just can't afford to shut down the factory whenever this internet connection goes down which which happens very rarely but maybe sometimes does happen and with the rise of modern web browsers there are better and better tools for deploying learning algorithms right there within a web browser as well when building a prediction service is also useful to take into account how much compute resources you have there have been quite a few times where i trained the neural network on a very powerful gpu only to realize that i couldn't afford an equally powerful set of gpus for deployment and wound up having to do something else to compress or reduce the model complexity so if you know how much cpu or gpu resources and maybe also how much memory resources you have for your prediction service then that could help you choose the right software architecture depending on your application especially if it's real-time application latency and throughput such as measure in terms of qps queries per second will be other software entering metrics you may need to hit in speech recognition it's not uncommon to want to get an answer back to the user within half a second or 500 milliseconds and of this 500 millisecond budget you may be able to allocate only say 300 milliseconds to your speech recognition and so that gives a latency requirement for your system and throughput refers to how many queries per second do you need to handle given your compute resources maybe given a certain number of cloud servers so for example if you're building a system that needs to handle a thousand queries per second it would be useful to make sure to spec out your system so that you have enough compute resources to hit the qps requirement next is logging when building your system it may be useful to log as much of the data as possible for analysis and review as well as to provide more data for retraining your learning algorithm in the future finally security and privacy i find that for different applications the required levels of security and privacy can be very different for example when i was working on electronic health records patient records clearly the requirements for security and privacy were very high because patient records are very highly sensitive information and depending on your application you might want to design in the appropriate level of security and privacy based on how sensitive that data is and also sometimes based on regulatory requirements so if you save this checklist somewhere going through this when you're designing your software might help you to make the appropriate software engineering choices when implementing your prediction service so to summarize deploying a system requires two broad sets of tools there is writing the software to enable you to deploy the system in production and there is what you need to do to monitor the system performance and to continue to maintain it especially in the phase of concept drift as well as data drift one of the things you see when you're building machine learning systems is that the practices for the very first deployments will be quite different compared to when you are updating or maintaining a system that has already previously been deployed i know that there's some engineers that view deploying the machine learning model as getting to the finish line unfortunately i think the first deployment means you may be only about halfway there and the second half of your work is just starting only after your first deployment because even after you've deployed there's a lot of work to feed the data back and maybe to update the model to keep on maintaining the model even in the phase of changes to the data one of the things we'll touch on in later videos is some of the differences between the first deployment such as if your product never had a speech recognition system but you've trained the speech recognition system and you're deploying it for the first time versus you already have had the learning algorithm running for some time and you want to maintain or update that implementation to summarize in this video you saw some of the machine learning or statistical related issues such as concept driven data drift as well as some of the software engineering related issues such as whether you need a batch or real-time prediction service and what are the compute and memory requirements you have to take into account now it turns out that when you're deploying a machine learning model there are a number of common design patterns the common deployment patterns that are used in many applications across many different industries in the next video you see what are some of the most common deployment patterns so that you can hopefully pick the right one for your application let's go on to the next videoone of the most exciting moments of any machine learning project is when you get to deploy your model but what makes deployment hard i think there are two major categories of challenges in deploying a machine learning model first are the machine learning or the statistical issues and second are the software engine issues let's step through both of these so you can understand what you need to do to make sure that you have a successful deployment of your system one of the challenges of a lot of deployments is concept drift and data drift loosely this means what if your data changes after your system has already been deployed i had previously given an example from manufacturing where you might have trained a learning algorithm to detect scratches on smartphones under one set of lighting conditions and then maybe the lighting and the factory changes that's one example of the distribution of data changes let's walk through a second example using speech recognition i've trained a few speech recognition systems and when i built speech systems quite often i would have some purchase data so this would be some purchase or license data which includes both the input x the audio as well as the transcript y that the speech system is supposed to output and in addition to data that you might purchase from a vendor you might also have historical user data of user speaking to your application together with transcripts of that real user data and such user data of course should be collected with very clear user opt-in permission and clear safeguards for user privacy after you've trained your speech recognition system on a data set like this you might then evaluate it on a test set but because speech data does change over time when i build speech recognition systems sometimes i will collect a def set or hold up validation set as well as test set comprising data from just the last few months so you can test it on fairly recent data to make sure your system works even on relatively recent data and then after you push the system to deployments the question is will the data change or after you've run it for a few weeks for a few months has the data changed yet again because of the data has changed such as the language changes or maybe people are using a brand new model of smartphone which has a different microphone so the audio sounds different then the performance of your speech recognition system can degrade and it's important for you to recognize how the data has changed and if you need to update your learning algorithm as a result when data changes sometimes it is a gradual change such as the english language which does change but changes very slowly with new vocabulary introduced at a relatively slow rate and sometimes data changes very suddenly where there's a sudden shock to a system for example when covet 19 the pandemic hit a lot of credit card fraud systems started to not work because the purchase patents of individuals suddenly change many people that did relatively little online shopping suddenly started to use much more online shopping and so the way that people were using credit cards changed very suddenly and this actually tripped up a lot of anti-fraud systems and this very sudden shift to the data distribution meant that many machine learning teams were scrambling a little bit at the start of covid to collect new data and retrain systems in order to make them adapt to this very new data distribution sometimes the terminology of how to describe these data changes is not used completely consistently but sometimes the term data drift is used to describe if the input distribution x changes such as if a new politician or celebrity suddenly becomes well known and is mentioned much more than before and the term concept drift refers to if the desired mapping from x to y changes such as if before covert 19 perhaps for a given user a lot of surprising online purchases should have flagged that account for fraud but after the starter cover 19 maybe those same purchases would not have really been any calls for alarming in terms of flagging that the credit card may have been stolen or another example of concept drift let's say that x is the size of a house and y is the price of a house because you're trying to estimate housing prices well if because of inflation or changes in the market houses maybe become more expensive over time then the same size hulls will end up with a higher price and so that would be concept drift maybe the size of houses haven't changed but the price of a given house changes whereas data drift would be if say people start building larger houses or start building smaller houses and thus the input distribution of the sizes and houses actually changes over time so when you deploy a machine learning system one of the most important tasks will often be to make sure you can detect and manage any changes including both concept drift which is when the definition of what is y given x changes as well as data drift which is if the distribution of x changes even if the mapping from x and y does not change in addition to managing these changes to the data a second set of issues that you will have to manage to deploy a system successfully are software engineering issues when you are implementing a prediction service whose job it is to take queries x and output prediction y you have a lot of design choices as to how to implement this piece of software here's a checklist of questions that might help you with making the appropriate decisions for managing the software engineering issues one decision you have to make for your application is do you need real-time predictions or are batch predictions okay for example if you are building a speech recognition system where the user speaks and you need to get a response back in half a second then clearly you need real-time predictions in contrast i have also built systems for hospitals that take patient records take electronic health records and run an overnight batch process to see if there's something associated with the patients that we can spot so in that type of system it was fine if we just ran it in a batch of patient records once per night and so whether you need to write real-time software they can respond within you know hundreds of milliseconds or whether you can write software that just does a lot of computation overnight that will affect how you implement your software and by the way later this week you also get to step through an optional programming exercise where you get to implement a real-time prediction service on your own computer you see that at the optional exercise at the end of this week so second question you need to ask is does your prediction service run in the cloud or does it run at the edge or maybe even in a web browser today there are many speech recognition systems that run in the cloud because having the compute resources of the cloud allows for more accurate speech recognition but there are also some speech systems for example a lot of speech systems within cars actually run at the edge and there are also some mobile speech recognition systems that work even if your internet even if your wi-fi is turned off so those would be examples of speech systems that run at the edge when i'm deploying visual inspection systems in factories i pretty much almost always run that at the edge as well because sometimes unavoidably the internet connection between the factory and you know the rest of the internet may go down and you just can't afford to shut down the factory whenever this internet connection goes down which which happens very rarely but maybe sometimes does happen and with the rise of modern web browsers there are better and better tools for deploying learning algorithms right there within a web browser as well when building a prediction service is also useful to take into account how much compute resources you have there have been quite a few times where i trained the neural network on a very powerful gpu only to realize that i couldn't afford an equally powerful set of gpus for deployment and wound up having to do something else to compress or reduce the model complexity so if you know how much cpu or gpu resources and maybe also how much memory resources you have for your prediction service then that could help you choose the right software architecture depending on your application especially if it's real-time application latency and throughput such as measure in terms of qps queries per second will be other software entering metrics you may need to hit in speech recognition it's not uncommon to want to get an answer back to the user within half a second or 500 milliseconds and of this 500 millisecond budget you may be able to allocate only say 300 milliseconds to your speech recognition and so that gives a latency requirement for your system and throughput refers to how many queries per second do you need to handle given your compute resources maybe given a certain number of cloud servers so for example if you're building a system that needs to handle a thousand queries per second it would be useful to make sure to spec out your system so that you have enough compute resources to hit the qps requirement next is logging when building your system it may be useful to log as much of the data as possible for analysis and review as well as to provide more data for retraining your learning algorithm in the future finally security and privacy i find that for different applications the required levels of security and privacy can be very different for example when i was working on electronic health records patient records clearly the requirements for security and privacy were very high because patient records are very highly sensitive information and depending on your application you might want to design in the appropriate level of security and privacy based on how sensitive that data is and also sometimes based on regulatory requirements so if you save this checklist somewhere going through this when you're designing your software might help you to make the appropriate software engineering choices when implementing your prediction service so to summarize deploying a system requires two broad sets of tools there is writing the software to enable you to deploy the system in production and there is what you need to do to monitor the system performance and to continue to maintain it especially in the phase of concept drift as well as data drift one of the things you see when you're building machine learning systems is that the practices for the very first deployments will be quite different compared to when you are updating or maintaining a system that has already previously been deployed i know that there's some engineers that view deploying the machine learning model as getting to the finish line unfortunately i think the first deployment means you may be only about halfway there and the second half of your work is just starting only after your first deployment because even after you've deployed there's a lot of work to feed the data back and maybe to update the model to keep on maintaining the model even in the phase of changes to the data one of the things we'll touch on in later videos is some of the differences between the first deployment such as if your product never had a speech recognition system but you've trained the speech recognition system and you're deploying it for the first time versus you already have had the learning algorithm running for some time and you want to maintain or update that implementation to summarize in this video you saw some of the machine learning or statistical related issues such as concept driven data drift as well as some of the software engineering related issues such as whether you need a batch or real-time prediction service and what are the compute and memory requirements you have to take into account now it turns out that when you're deploying a machine learning model there are a number of common design patterns the common deployment patterns that are used in many applications across many different industries in the next video you see what are some of the most common deployment patterns so that you can hopefully pick the right one for your application let's go on to the next video\n"