Why Do People Say It’s So Hard To Deploy A ML Model To Production?

Jul 8, 2022 • Written By Tim Liu

You’ve heard it said, you’ve heard it written and you’ve heard it sung from the rooftops by analysts and vendors alike: 87% of data science projects never make it into production

Why is that?

Well, of course, it always depends, but the slightly more precise answer is that building ML services is more complex than other types of software. Different technologies, lack of best practices, and the cross-functional nature all contribute to the challenges of deploying ML models.

In this article, I dig into some of the most common challenges that ML teams are faced with while shipping a trained model as a production-level prediction service, and offer some tips and recommendations. But first, it’s important to distinguish how ML projects differ from traditional software engineering and even data engineering projects.

ML: Not Your Average Engineering Deployment

I’ve spent a lot of my career building data applications, and I’ve found that they share many of the same complexities as ML. However, with ML, you have one additional artifact adding another shifting dimension: the model itself. 

Here’s how I’ve started think of it:

In traditional software engineering, your main concern is the code.

In data engineering, your main concerns are the code + the data.

In ML engineering, your main concerns are the code + the data + the model.

It’s not just me, either. Software thought leader, Martin Fowler, shares in a recent article why ML services are so complex. Hint: it’s because of the shifting nature of these three components.

Let’s dig in.

The code, for the most part, is a fairly known quantity and has a deterministic quality about it. In other words, given the same input, it will produce the same output. Even if your code is stateful (not recommended), I’d argue that the “state” is part of the input.

The data is deterministic in a way, but having spent years in data, it’s difficult to know what types of data and at what volumes you will receive it. You will always be surprised. For this reason, I would say the data is the most variable piece of the puzzle.

I think of the model as a deterministic function, albeit the most blackbox of the three. Imagine the worst coder in the world writing spaghetti code that even the most experienced coder couldn’t figure out. A trained model is analogous to this code in that it is composed of complex vectors which by themself are very difficult to understand, but given a particular input, the output should always be the same for a given version of a model.  

Challenge 1: Reproducing The Model In Production

After a model has been trained, the next step is moving it to a centralized location where it can be used to build or update a prediction service. Models are trained in a variety of environments, from Jupyter notebooks to distributed experiment systems, and the methods for saving them can vary depending on the library that you’re using.

Model Persistence: Saving Your Model In The Right Format

ML frameworks have different methods of saving a model to ensure the reproducibility of the prediction. Depending on how you save the model, it could be serializing the Python class or just saving the weights of the model. Both methods could fail to reproduce the model depending on the context of the training environment. If the model is not saved correctly, then it could produce erratic predictions in production.

Model Mobility: How To Move Your Model From Training To Production

Once the model has been saved, you’ll want to move it, along with its associated files. When saving a model, multiple files may be generated as associated metadata -- a requirement for loading the model again.

A centralized location for model storage is not only required so that a developer can create the prediction service, it is also required to have an efficient CI/CD process so that when the service is being deployed to production, the model can be downloaded by the deployment pipeline and packaged into the production deployment.

Model Versioning: Avoid Losing A High-Performing Model

Persisting various versions of the model is extremely important in order to rollback bad updates or debug working models. You may even want to run multiple model versions in production (known as A/B testing) in order to gauge which model runs the best. More sophisticated deployments can run model version experiments like “multi-arm bandit” that can score the models in real time and direct more traffic to the best model.

Tips

• Do your research when saving your model and make sure it’s the recommended way of saving. For example, Pytorch has documentation that explains the various methods used to save the model depending on your context.

• Tools like S3 or Google Cloud Storage can be an elegant solution as a shared model repository that Data Scientists and developers can use to share models.

• Try to find a standard which you can use to coordinate the model training with the model service. Tools like MLFlow, BentoML, TFServing and Cortex help to coordinate the model into a model registry and ultimately to a deployed destination. For example, use MLFlow to manage your experimentation and training and BentoML for model serving and deployment.

Challenge 2: Coding The Prediction Service

The prediction service is used as the glue to expose an API, transform the data to extract appropriate features, use the model for inference, and incorporate business logic. The code for the service is usually written by a trained software developer, but on smaller teams, the Data Scientist may have the responsibility of creating the service.

Data Transformation And Feature Extraction

The feature extraction code is commonly written by the Data Scientist, so developers will need to migrate the code from the training environment to the prediction service. Additionally, they may have to translate the input to the service (perhaps a REST request) into usable data. Finally, they’ll have to write the code that takes the output of the inference and translates it into something that the calling service can understand.

Dependency Management

Dependencies in a prediction service are very important because if they are not precisely configured, not only could the code behave differently than expected, the model itself could make different predictions than in the training environment. The right versions of the ML library, the runtime, and the dependencies should all be used in order for the service to run correctly.

Environment Provisioning

A scalable prediction service needs the proper environment in order to run well. For example, particular ML libraries can only run on particular OS distributions. Docker is the standard for configuring a reproducible environment, but configuring Docker for an ML use case can be extremely challenging. If configured correctly, it will give you a clean artifact from which you can deploy to a number of different services, as well as provide the ability to run the service locally so that you can debug it if there is an issue.

Tips

• Unless you absolutely need it, create the model serving service in Python. Most Data Scientists use Python. It makes the deployment process way more streamlined. In my experience, users go from “weeks/months” to days when redeploying a new model. Some teams need ridiculously high performance and prefer to convert all the code into Rust/Go. Most do not. By the way, AWS Sagemaker is written in Python.

• Make sure you use the same version of Python and the ML library that you used to train the model. You can use “pip freeze” to determine which versions you’re using.

• For GPU support, use Nvidia’s docker image, it provides a good base out of the box to GPU workloads. Believe me, you do not want to end up in “cuda hell”

• Make sure the Docker image (or associated runtime) exposes health, documentation and monitoring endpoints for your service.

• Select the deployment environment that makes the most sense for your team. Choose something you’re already familiar with that you’re deploying other applications into already. This reduces the learning curve for maintaining this new pipeline

Challenge 3: Dealing With Ever-Changing Data In Production

Finally, once your service is in production, there is no telling how the data will change over time — whether that means all of a sudden receiving bad data or simply changing data as users change their behavior. Change in the data is something that I’ve never been able to predict in my years building data applications. When the data changes unexpectedly, you can still manage it with the strategies below. 

Data Drift And Performance Monitoring

Changes in the data over time can result in a variety of issues from the model not making as effective predictions to the model not working at all. Data drift, or variations in production input data, can happen as time passes or seasons change. It’s important to know if it’s happening so that you can retrain your model if necessary.

Model performance can also change as time passes, but can be much more difficult to monitor because it requires you to join predictions with downstream outcomes. For example, connecting the dots between a product recommendation and the potential sale of the product.

Retraining And Redeployment

Retraining a model is often the solution to both data drift and performance issues. Retraining not only involves making sure you have fresh data but also that your system is agile enough to ship the new model in an efficient and automated manner. Just in traditional software deployments, the CI/CD pipeline needs to be integrated with your model repository so that it can ship new models quickly when you have trained a new one.

Validating New Models With Production Data

As data changes in production, you may not necessarily want to risk deploying a new model. For this reason, it can be helpful to have ways to test production data with models that you’re still validating. Especially in cases where performance is suffering, you want to make sure that a newly-trained model will actually help performance rather than make things worse.  

Tips

• Deploying often and at speed will give you more confidence in your pipeline. Not only that, it has added benefits like data drift not being as problematic

• Monitoring performance often includes joining with downstream metrics. Make sure you’re sending your inference results to the same place (likely a data warehouse) as the metrics you want to join with.

• Use ML specific monitoring tools like WhyLabs, Evidently.ai or Aporia can help detect data drift and monitor performance.

• Using a service mesh like Istio or AWS App Mesh can help fan out production traffic and help validate new models using shadow pipelines

Conclusion

The complexity introduced by having 3 different shifting dimensions is not trivial especially given that the development process often involves multiple stakeholders. Best practices are still in their early days because operationalizing ML projects is such a new field. With more and more tools and practitioners entering this market I feel confident that best practices will emerge just like the did in traditional software, devops and even data engineering.

As Andrew Ng puts it, the point of these MLOps tools and processes is to make “AI an efficient and systematic process.” The more we can streamline these processes the better companies can begin to benefit from ML.