Production Inference Deployment with PyTorch

Installing Torch Serve and Torch Model Archiver on Mac

Since I'm installing on a Mac, I can skip that now with the dependencies installed. I can either install from source or use pipper/conda. I'm actually installing two programs torch serve and the torch model archiver which we'll get to in a minute.

If you're installing with conda don't forget to specify the pi torch channel with dash c pi torch everything your torch server environment needs is a model store directory. All your models served by torch serve are stored in this folder. You can name it anything you like but I'm going to keep that simple.

Next we'll need a model to serve. Torch serve expects models to be packaged in a model archive which contains your model's code and weights along with any other files needed to support your model. For example in a natural language application you might have embeddings or vocabularies that you need to package with your model. A model archive is created by the model archiver which was the package I installed alongside torch serve above.

First we'll need to download some trained model weights. Next let's create a model archive from these weights. Taking these arguments one at a time every model has a name here "densenet161". A model needs a version number here we just went with 1.0. We're going to be using a python based model so we use the model file flag to bring in the python file containing the model class. The serialized file argument specifies the file containing the model weights. If we were loading a torch script model we'd skip the model file argument and just specify the serialized torscript file here we're going to bring in an extra support file a json file containing mappings of the model's trained category ids to human readable labels.

Finally every model archive needs a handler to transform and repair incoming data for inference. I'm going to use the built-in image classifier handler but it's also possible to write your own handler and specify that file here now you can see we have a dot mar file this is our model archive it belongs in the model store so let's put it there.

Now let's start torch serve. We'll do so with four arguments. The start flag should be self-explanatory by default torch serve stores its current configuration and loads its last config on startup and the ncs flag suppresses this behavior. The model store flag lets us specify our model store folder and optionally we can tell torch serve to start with the model loaded. We'll specify our new model archive for densenet 161.

Torch Serve Running

torch server puts out a lot of helpful information all of which is also saved in log files let's have a look at the logs folder now. Note that a log directory has been created alongside our model store and here you can see we have logs for all torch service behavior and metrics.

Now the torch serve is running. Let's do some inference. We'll grab a sample image from the source repo over our input and then we'll call curl on the torch serve side. The default image classifier model takes care of unpacking the image and converting it to a tensor feeding it to the model and processing the output. This shows a simple case of using the torch serve inference api over http but you can also access it via grpc or use the kf serving api used by kubeflow on kubernetes.

Accessing the Management API

The default for the management api is port 8081. Let's use this curl command to see how the server reports what models it's serving. The model's endpoint enumerates models being served which right now is just our densenet model. Let's get a little more detail on it and here you can see it specifies things about this particular model including how many workers are spun up etc.

We can be more specific if we have more than one version of the model by adding the version number to the url. This shows the default configuration for a serves model with 12 workers running. You can also use the management api to alter that configuration so let's change the number of workers so I set both the min and max workers to four and now if I ask for the status of our model again we should see the number of workers has changed.

The Management API

The management api lets you register new models from a local model archive or from a url it lets you unregister models or set the default version of a model to serve or get the status of a model or models that you're serving. Finally we can stop torch serve with the stop flag. The torch serve github repo also has walkthroughs and examples for many common tasks including specific server management tasks setting up https writing a custom handler and more. And as always everything I've described here and more is documented fully in the documentation and tutorials at pytorch.org

"WEBVTTKind: captionsLanguage: enwelcome to the next video in the pie torch training series this video will talk about deploying your pi torch model for inference in production in particular this video will talk about putting your pi torch model in evaluation mode converting your model to torch script and performing inference using torch script with c plus and deploying your model with torch serve which is pi torch's model serving solution no matter which deployment method you use the first thing you should always do is put your model into evaluation mode evaluation mode is the opposite of training mode it turns off training related behaviors that you don't want during inference time in particular it turns off auto grad you may recall from the earlier video on autograd that pi torch tensors including your model's learning weights track their computation history to aid the rapid computation of backward gradients for learning this can be expensive in terms of both memory and compute and is not something you want to inference time eval mode also changes the behavior of certain modules that contain training specific functionality in particular dropout layers are only active during training time setting your model in eval mode makes drop out of no-op batch norm layers track running stats on their computed mean and variance during training but this behavior is turned off for eval mode next let's look at the procedure for putting your model in evaluation mode first you'll want to load your model for a python based model that will involve loading the model's state dictionary from disk and initializing your object with it then you call the eval method on your model and you're done your model has now turned off training related behaviors for inference it's worth noting that the eval method is actually just an alias for calling the train method with an argument of false you may find this useful if your code already contains a flag that indicates whether you're doing training or inference once you're in eval mode you can start sending your model batches of data for inference in the rest of this video we're going to talk about different methods for deploying your model for inference but for all of them making sure your model is in evaluation mode is your first step so what is torch script it's a statically typed subset of python for representing pi torch models and it's meant to be consumed by the jet the pytorch just-in-time compiler which performs runtime optimizations to improve the performance of your model it also allows you to save your model and weights in a single file and load them as a script module object that you can call just as you would your original model so how do you use torch script build test and train your model in python as you normally would when you want to export your model for production inference you can use the torch.jit.trace or torch.jet.script calls to convert your model to torchscript after that you can call the.save method on your torchgrip model to save it to a single file that contains both the computation graph and the learning weights for your model the just in time compiler executes your torch script model performing runtime optimizations such as operator fusion and batching matrix multiplications you can also write your own custom extensions to torch script in c plus the code on the right shows what torchscript looks like but in the general case you won't have to edit it yourself it's generated from your python code let's walk through the process of using torch script in more detail the process starts with the model you've built in python and trained to the point of readiness for deployment the next step is to convert your model to torque script there are two ways to do this torch.jet.script and torch.jet.trace it's important to note the differences between the two techniques for converting your model to torque script torch.jet.script converts your model by directly inspecting your code and running it through the torchscript compiler it preserves control flow which you'll need if your forward function has conditionals or loops and it accommodates common python data structures however due to limitations of python operator support in the torch script compiler some models won't be convertible using torch.jet.script torch.jet.trace takes a sample input and traces it through your computation graph to generate the torch script version of your model this doesn't suffer the operator coverage limitations of torch.jet.script but because it only traces a single path through your code it won't respect conditionals or other control flow structures that might cause variable or non-deterministic runtime behavior it's also possible to mix tracing and scripting when converting a model see the documentation for the torch.jit module for notes on mixing the two techniques it's worth looking at the docs to see the optional arguments for script and trace there are extra options for checking the consistency and tolerances of your torch script model now we'll save our torscript model this saves both your computation graph and your learning weights in a single file which means you don't have to ship the python file with your model's class definition when you want to deploy to production when it's time to do inference you call torch.jet.load on your model and feed it batches of input in the same way you would the python version of your model everything i've shown you up to now has involved manipulating your model in python code even after you've converted it to torch script there are situations and environments though where you may need high throughput or real-time inference and would like to do without the overhead of the python interpreter it may also be the case that your production environment is already centered around c plus plus code and you'd like to continue using c plus plus as much as possible you may recall from an earlier video in this series that the important tensor computations in pi torch happened in libtorch a compiled and optimized c plus library pi torch also has a c plus front end to this library this means that you can load your torch script model in c plus plus and run it with no python runtime dependencies the first thing you'll need to do is to go to pytorch.org and download the latest version of libtorch unzip the package and place it where your make system can find it this slide shows a minimal cmake file for a project using libtorch know that you'll need to be using c plus 14 or higher to make use of libtorch in python you'd import torch use torch.jet.load to bring your torchscript model into memory and then call your model with an input batch the process is not so different in c plus first include torch script dot h this is your one stop include for working with torch script and c plus plus next declare a torch script module variable then use torchjet load to load it into memory to get predictions from your model call its forward method with an appropriate input here we've created a dummy input with torch ones you'd be bringing in your own inputs of whatever size your model requires once you have your output predictions as a tensor you can manipulate them with the c plus plus equivalence of the tensor methods you're used to in pi torch's python front at the pi torch.org tutorial section includes content walking you through setting up a c plus project as well as multiple tutorials demonstrating aspects of the c plus front end setting up a production model serving environment can be complex especially if you're serving multiple models working with multiple versions of models require scalability or want detailed logging or metrics torch serve is the pi torch model serving solution that covers all these needs and more torch serve loads instances of your model or models in individual process spaces and distributes incoming requests to them and has a number of features to make it useful for creating ml-based web services it has data handlers covering common use cases including image classification segmentation object detection and text classification it allows you to set version identifiers for models and you can manage and simultaneously serve multiple versions of a model it can optionally batch input requests from multiple sources which can sometimes improve throughput it features robust logging and the ability to log your own metrics and it has separate restful apis for inference and model management which may be secured with https i'll wrap up this video by walking through setting up and running torch serve with one of the examples available at github.com pytorch serve in the examples folder we'll set up a pre-trained image classification model for inference first let's install torch serve i'll demonstrate the process for setting it up on a linux or mac system but torch serve also works on windows if that's your preferred server environment first i'm going to create a new conda environment for torch serve i'm going to clone the source repository because it has convenient scripts for correctly installing torch serve dependencies when you run the dependency install script on a machine with nvidia gpus you may need to specify what version of the cuda drivers you have installed details are in the install procedure described in torch serves readme on github since i'm installing on a mac i can skip that now with the dependencies installed i can either install from source or use pipperconda i'm actually installing two programs torch serve and the torch model archiver which we'll get to in a minute if you're installing with conda don't forget to specify the pi torch channel with dash c pi torch everything your torch server environment needs is a model store directory all your models served by torch serve are stored in this folder you can name it anything you like but i'm going to keep that simple next we'll need a model to serve torch serve expects models to be packaged in a model archive which contains your model's code and weights along with any other files needed to support your model for example in a natural language application you might have embeddings or vocabularies that you need to package with your model a model archive is created by the model archiver which was the package i installed alongside torch serve above first we'll need to download some trained model weights next let's create a model archive from these weights taking these arguments one at a time every model has a name here densenet161 a model needs a version number here we just went with 1.0 we're going to be using a python based model so we use the model file flag to bring in the python file containing the model class the serialized file argument specifies the file containing the model weights if we were loading a torch script model we'd skip the model file argument and just specify the serialized torscript file here we're going to bring in an extra support file a json file containing mappings of the model's trained category ids to human readable labels finally every model archive needs a handler to transform and repair incoming data for inference i'm going to use the built-in image classifier handler but it's also possible to write your own handler and specify that file here now you can see we have a dot mar file this is our model archive it belongs in the model store so let's put it there now let's start torch serve we'll do so with four arguments the start flag should be self-explanatory by default torch serve stores its current configuration and loads its last config on startup and the ncs flag suppresses this behavior the model store flag lets us specify our model store folder and optionally we can tell torch serve to start with the model loaded we'll specify our new model archive for densenet 161. torch server puts out a lot of helpful information all of which is also saved in log files let's have a look at the logs folder now note that a log directory has been created alongside our model store and here you can see we have logs for all torch service behavior and metrics now the torch serve is running let's do some inference we'll grab a sample image from the source rep over our input and then we'll call curl on the torch serve side the default image classifier model takes care of unpacking the image and converting it to a tensor feeding it to the model and processing the output this shows a simple case of using the torch serve inference api over http but you can also access it via grpc or use the kf serving api used by kubeflow on kubernetes and here we have the top five classes identified by the model if we want to learn about the status of the server or manage which models we're serving or how many worker processes are devoted to each worker we can use the management api above we use the prediction api on its default port of 8080. the default for the management api is port 8081. let's use this curl command to see how the server reports what models it's serving the model's endpoint enumerates models being served which right now is just our densenet model let's get a little more detail on it and here you can see it specifies things about this particular model including how many workers are spun up etc we can be more specific if we have more than one version of the model by adding the version number to the url this shows the default configuration for a serves model with 12 workers running you can also use the management api to alter that configuration so let's change the number of workers so i set both the min and max workers to four and now if i ask for the status of our model again we should see the number of workers has changed the management api lets you register new models from a local model archive or from a url it lets you unregister models or set the default version of a model to serve or get the status of a model or models that you're serving finally we can stop torture with the stop flag the torch serve github repo also has walkthroughs and examples for many common tasks including specific server management tasks setting up https writing a custom handler and more and as always everything i've described here and more is documented fully in the documentation and tutorials at pytorch.orgwelcome to the next video in the pie torch training series this video will talk about deploying your pi torch model for inference in production in particular this video will talk about putting your pi torch model in evaluation mode converting your model to torch script and performing inference using torch script with c plus and deploying your model with torch serve which is pi torch's model serving solution no matter which deployment method you use the first thing you should always do is put your model into evaluation mode evaluation mode is the opposite of training mode it turns off training related behaviors that you don't want during inference time in particular it turns off auto grad you may recall from the earlier video on autograd that pi torch tensors including your model's learning weights track their computation history to aid the rapid computation of backward gradients for learning this can be expensive in terms of both memory and compute and is not something you want to inference time eval mode also changes the behavior of certain modules that contain training specific functionality in particular dropout layers are only active during training time setting your model in eval mode makes drop out of no-op batch norm layers track running stats on their computed mean and variance during training but this behavior is turned off for eval mode next let's look at the procedure for putting your model in evaluation mode first you'll want to load your model for a python based model that will involve loading the model's state dictionary from disk and initializing your object with it then you call the eval method on your model and you're done your model has now turned off training related behaviors for inference it's worth noting that the eval method is actually just an alias for calling the train method with an argument of false you may find this useful if your code already contains a flag that indicates whether you're doing training or inference once you're in eval mode you can start sending your model batches of data for inference in the rest of this video we're going to talk about different methods for deploying your model for inference but for all of them making sure your model is in evaluation mode is your first step so what is torch script it's a statically typed subset of python for representing pi torch models and it's meant to be consumed by the jet the pytorch just-in-time compiler which performs runtime optimizations to improve the performance of your model it also allows you to save your model and weights in a single file and load them as a script module object that you can call just as you would your original model so how do you use torch script build test and train your model in python as you normally would when you want to export your model for production inference you can use the torch.jit.trace or torch.jet.script calls to convert your model to torchscript after that you can call the.save method on your torchgrip model to save it to a single file that contains both the computation graph and the learning weights for your model the just in time compiler executes your torch script model performing runtime optimizations such as operator fusion and batching matrix multiplications you can also write your own custom extensions to torch script in c plus the code on the right shows what torchscript looks like but in the general case you won't have to edit it yourself it's generated from your python code let's walk through the process of using torch script in more detail the process starts with the model you've built in python and trained to the point of readiness for deployment the next step is to convert your model to torque script there are two ways to do this torch.jet.script and torch.jet.trace it's important to note the differences between the two techniques for converting your model to torque script torch.jet.script converts your model by directly inspecting your code and running it through the torchscript compiler it preserves control flow which you'll need if your forward function has conditionals or loops and it accommodates common python data structures however due to limitations of python operator support in the torch script compiler some models won't be convertible using torch.jet.script torch.jet.trace takes a sample input and traces it through your computation graph to generate the torch script version of your model this doesn't suffer the operator coverage limitations of torch.jet.script but because it only traces a single path through your code it won't respect conditionals or other control flow structures that might cause variable or non-deterministic runtime behavior it's also possible to mix tracing and scripting when converting a model see the documentation for the torch.jit module for notes on mixing the two techniques it's worth looking at the docs to see the optional arguments for script and trace there are extra options for checking the consistency and tolerances of your torch script model now we'll save our torscript model this saves both your computation graph and your learning weights in a single file which means you don't have to ship the python file with your model's class definition when you want to deploy to production when it's time to do inference you call torch.jet.load on your model and feed it batches of input in the same way you would the python version of your model everything i've shown you up to now has involved manipulating your model in python code even after you've converted it to torch script there are situations and environments though where you may need high throughput or real-time inference and would like to do without the overhead of the python interpreter it may also be the case that your production environment is already centered around c plus plus code and you'd like to continue using c plus plus as much as possible you may recall from an earlier video in this series that the important tensor computations in pi torch happened in libtorch a compiled and optimized c plus library pi torch also has a c plus front end to this library this means that you can load your torch script model in c plus plus and run it with no python runtime dependencies the first thing you'll need to do is to go to pytorch.org and download the latest version of libtorch unzip the package and place it where your make system can find it this slide shows a minimal cmake file for a project using libtorch know that you'll need to be using c plus 14 or higher to make use of libtorch in python you'd import torch use torch.jet.load to bring your torchscript model into memory and then call your model with an input batch the process is not so different in c plus first include torch script dot h this is your one stop include for working with torch script and c plus plus next declare a torch script module variable then use torchjet load to load it into memory to get predictions from your model call its forward method with an appropriate input here we've created a dummy input with torch ones you'd be bringing in your own inputs of whatever size your model requires once you have your output predictions as a tensor you can manipulate them with the c plus plus equivalence of the tensor methods you're used to in pi torch's python front at the pi torch.org tutorial section includes content walking you through setting up a c plus project as well as multiple tutorials demonstrating aspects of the c plus front end setting up a production model serving environment can be complex especially if you're serving multiple models working with multiple versions of models require scalability or want detailed logging or metrics torch serve is the pi torch model serving solution that covers all these needs and more torch serve loads instances of your model or models in individual process spaces and distributes incoming requests to them and has a number of features to make it useful for creating ml-based web services it has data handlers covering common use cases including image classification segmentation object detection and text classification it allows you to set version identifiers for models and you can manage and simultaneously serve multiple versions of a model it can optionally batch input requests from multiple sources which can sometimes improve throughput it features robust logging and the ability to log your own metrics and it has separate restful apis for inference and model management which may be secured with https i'll wrap up this video by walking through setting up and running torch serve with one of the examples available at github.com pytorch serve in the examples folder we'll set up a pre-trained image classification model for inference first let's install torch serve i'll demonstrate the process for setting it up on a linux or mac system but torch serve also works on windows if that's your preferred server environment first i'm going to create a new conda environment for torch serve i'm going to clone the source repository because it has convenient scripts for correctly installing torch serve dependencies when you run the dependency install script on a machine with nvidia gpus you may need to specify what version of the cuda drivers you have installed details are in the install procedure described in torch serves readme on github since i'm installing on a mac i can skip that now with the dependencies installed i can either install from source or use pipperconda i'm actually installing two programs torch serve and the torch model archiver which we'll get to in a minute if you're installing with conda don't forget to specify the pi torch channel with dash c pi torch everything your torch server environment needs is a model store directory all your models served by torch serve are stored in this folder you can name it anything you like but i'm going to keep that simple next we'll need a model to serve torch serve expects models to be packaged in a model archive which contains your model's code and weights along with any other files needed to support your model for example in a natural language application you might have embeddings or vocabularies that you need to package with your model a model archive is created by the model archiver which was the package i installed alongside torch serve above first we'll need to download some trained model weights next let's create a model archive from these weights taking these arguments one at a time every model has a name here densenet161 a model needs a version number here we just went with 1.0 we're going to be using a python based model so we use the model file flag to bring in the python file containing the model class the serialized file argument specifies the file containing the model weights if we were loading a torch script model we'd skip the model file argument and just specify the serialized torscript file here we're going to bring in an extra support file a json file containing mappings of the model's trained category ids to human readable labels finally every model archive needs a handler to transform and repair incoming data for inference i'm going to use the built-in image classifier handler but it's also possible to write your own handler and specify that file here now you can see we have a dot mar file this is our model archive it belongs in the model store so let's put it there now let's start torch serve we'll do so with four arguments the start flag should be self-explanatory by default torch serve stores its current configuration and loads its last config on startup and the ncs flag suppresses this behavior the model store flag lets us specify our model store folder and optionally we can tell torch serve to start with the model loaded we'll specify our new model archive for densenet 161. torch server puts out a lot of helpful information all of which is also saved in log files let's have a look at the logs folder now note that a log directory has been created alongside our model store and here you can see we have logs for all torch service behavior and metrics now the torch serve is running let's do some inference we'll grab a sample image from the source rep over our input and then we'll call curl on the torch serve side the default image classifier model takes care of unpacking the image and converting it to a tensor feeding it to the model and processing the output this shows a simple case of using the torch serve inference api over http but you can also access it via grpc or use the kf serving api used by kubeflow on kubernetes and here we have the top five classes identified by the model if we want to learn about the status of the server or manage which models we're serving or how many worker processes are devoted to each worker we can use the management api above we use the prediction api on its default port of 8080. the default for the management api is port 8081. let's use this curl command to see how the server reports what models it's serving the model's endpoint enumerates models being served which right now is just our densenet model let's get a little more detail on it and here you can see it specifies things about this particular model including how many workers are spun up etc we can be more specific if we have more than one version of the model by adding the version number to the url this shows the default configuration for a serves model with 12 workers running you can also use the management api to alter that configuration so let's change the number of workers so i set both the min and max workers to four and now if i ask for the status of our model again we should see the number of workers has changed the management api lets you register new models from a local model archive or from a url it lets you unregister models or set the default version of a model to serve or get the status of a model or models that you're serving finally we can stop torture with the stop flag the torch serve github repo also has walkthroughs and examples for many common tasks including specific server management tasks setting up https writing a custom handler and more and as always everything i've described here and more is documented fully in the documentation and tutorials at pytorch.org\n"