BentoML Or Triton Inference Server? Choose Both!

Mar 22, 2023 • Written By Aaron Pham

⚠️ Outdated content

Note: The content in this blog post may not be applicable any more and the BentoML team is working on a new implementation. For more information, see the BentoML documentation.

We are seeing a surge in recent months of developments and works on large language models (LLM) and its applications such as ChatGPT, Stable Diffusion, Copilot.

However, deploying and serving LLMs at scale is a challenging task that requires specific domain expertise and inference infrastructure. A rough estimation of running ChatGPT shows that serving efficiency is critical to making such models work at scale. These operations are often known as Large Language Models Operations (LLMOps). LLMOps, in general, is considered as a subset of MLOps, which is a set of practices combining software engineering, DevOps, and data science to automate and scale the end-to-end lifecycle of ML models.

Teams can encounter several problems when running inference on large models, including:

  • Resource utilisation: Large models require a significant amount of computational power to run, which can be challenging for teams with limited resources. Serving frameworks should utilise all available resources to be cost-effective.
  • Model optimisation: Large models often contains a lot of redundant layers and parameters that can be pruned to reduce model size and speed up inference. A serving framework ideally should be able to provide support for model optimisation library to aid this process.
  • Serving latency: Large language models often require complex batching strategies to enable real-time inference. A serving framework should be equipped with batching strategies to optimise for low-latency serving.

triton_bento.png

In this blog post, we will be demonstrating the capabilities of BentoML and Triton Inference Server to help you solve these problems.

What Is Triton Inference Server?

Triton Inference Server is a high performance, open-source inference server for serving deep learning models. It is designed to serve a variety of deep learning models and frameworks, such as ONNX, TensorFlow, TensorRT. It is also designed with optimisations to maximise hardware utilisation through various model execution and efficient batching strategies.

Triton Inference Server is great for serving large language models, where you want a high-performance inference server that can utilise all available resources with complex batching strategies.

What Is BentoML?

BentoML is an open-source platform designed to facilitate the development, shipping, and scaling of AI applications. It empowers teams to rapidly develop AI applications that involve multiple models and custom logic using Python. Once developed, BentoML allows these applications to be seamlessly shipped to production on any cloud platform with engineering best practices already integrated. Additionally, BentoML makes it easy to scale these applications efficiently based on usage, ensuring that they can handle any level of demand.

What Motivated The Triton Integration?

Starting BentoML v1.0.16, Triton Inference Servers can now be seamlessly used as a Runner. Runners are abstractions of logic that can execute on either CPU or GPU and scale independently. Prior to the Triton integration, one of the drawbacks of using Python runners is the Global Interpreter Lock (GIL), where it only allows one thread to be executed at a time. While the model inference can still run on GPU or multi-threaded CPU, the IO logic is still subjective to the limitations of GIL, which limits the underlying hardware utilisation (CPU and GPU). Triton’s C++ runtime is optimised for high throughput model serving. By using Triton as a runner, users can take full advantages of Triton’s high-performance inference, while continue enjoy all features that BentoML offers.

Too Much Talking? Ok Lets Dive In!

In the following tutorial, we will use a PyTorch YOLOv5 object detection example. The source code can be found in the Triton PyTorch YOLOv5 example project. You can also find the TensorFlow and ONNX examples under the same directory.

The TLDR is that BentoML provides the capabilities for users to run Triton Inference Server via bentoml.triton.Runner:

triton_runner = bentoml.triton.Runner("triton-runner", model_repository="s3://org/model_repository")

In order to use the bentoml.triton API, users are required to have the Triton Inference Server container image available locally.

docker pull nvcr.io/nvidia/tritonserver:23.01-py3

Install the extension for BentoML with Triton support:

pip install -U "bentoml[triton]"

The following section assumes that you have a basic understanding of BentoML architecture. If you are new to BentoML, we recommend you to read our Getting Started guide first.

Preparing Your Model

To prepare your model repository under your BentoML project, you will need to put your model in the following file structure:

» tree model_repository model_repository └── torchscript_yolov5s ├── 1 │ └── model.pt └── config.pbtxt

Where 1 is the version of the model, and model.pt is the TorchScript model.

Note that the model weight file name must prefix with model.<extensions> for all Triton model

The config.pbtxt file is the model configuration that denotes how Triton can serve this models.

The example for the config.pbtxt for YOLOv5 model is as follows:

platform: "pytorch_libtorch" input { name: "INPUT__0" data_type: TYPE_FP32 dims: -1 dims: 3 dims: 640 dims: 640 } output { name: "OUTPUT__0" data_type: TYPE_FP32 dims: -1 dims: 25200 dims: 85 }

Note that for PyTorch models, you will need to export your model to TorchScript first. Refer to PyTorch's guide to learn more about how to convert your model to TorchScript.

Create A Triton Runner

Now that we have our model repository ready, we can create a Triton Runner to interact with others BentoML Runners.

triton_runner = bentoml.triton.Runner("triton-runner", model_repository="./model_repository")

You can also use S3 or GCS as your model repository, by passing the path to your S3 or GCS bucket to the model_repository argument. If S3 or GCS bucket is defined, the model will not be packaged into the Bento image, but downloaded at runtime before serving.

triton_runner = bentoml.triton.Runner("triton-runner", model_repository="gcs://org/model_repository")

Each model in the model repository can be accessed via the signature of this triton_runner object.

For example, the model torchscript_yolov5s can be accessed via triton_runner.torchscript_yolov5s, and you can invoke the inference of such model with run or async_run method. This is similar to how other BentoML's built-in Runners work.

@svc.api(input=bentoml.io.Image(), output=bentoml.io.NumpyNdarray()) async def infer(im: Image) -> NDArray[t.Any]: inputs = preprocess(im) InferResult = await triton_runner.torchscript_yolov5s.async_run(inputs) return InferResult.as_numpy("OUTPUT__0")

Let's unpack this code snippet. First we define an async API that takes in an image and returns a numpy array. We then do some pre-processing to the input images and pass it into the model torchscript_yolov5s via triton_runner.torchscript_yolov5s.async_run.

The signature of async_run or run method is as follows:

async_run and run can only take either all positional arguments or all keyword arguments. The arguments must match the input signature of the model specified in the config.pbtxt file.

From the aforementioned config.pbtxt, we can see that the input signature of the model is INPUT__0, which is a 3-dimensional tensor of type TYPE_FP32 with a batch dimension. This means async_run/run method can only take in either a single positional argument or a single keyword argument with the name INPUT__0.

# valid triton_runner.torchscript_yolov5s.async_run(inputs) triton_runner.torchscript_yolov5s.async_run(INPUT__0=inputs)

run/async_run returns a InferResult object, which is a wrapper around the response from Triton Inference Server. Refer to the internal doc-string for more details.

Additionally, the Triton runner also exposes all tritonclient model management APIs so that users can fully utilize all features provided by Triton Inference Server.

For example, one can load/unload the model dynamically via the endpoint /load_model and /unload_model respectively:

@svc.api( input=bentoml.io.Text.from_sample("torchscript_yolov5s"), output=bentoml.io.JSON() ) async def unload_model(input_model: str): await triton_runner.unload_model(input_model) return {"unloaded": input_model} @svc.api( input=bentoml.io.Text.from_sample("torchscript_yolov5s"), output=bentoml.io.JSON() ) async def load_model(input_model: str): await triton_runner.load_model(input_model) return {"loaded": input_model}

Packaging BentoService With Triton Inference Server

To package your BentoService with Triton Inference Server, you can add the following to your existing bentofile.yaml:

include: - /model_repository docker: base_image: nvcr.io/nvidia/tritonserver:22.12-py3

Note that the base_image is the Triton Inference Server docker image from NVIDIA's container catalog.

If the model repository is stored in S3 or GCS, there is no need to add the include section.

That's it! Build the BentoService and containerize with bentoml build and bentoml containerize respectively:

bentoml build bentoml containerize triton-integration:latest

Conclusion

Congratulations! You can now fully utilize the power of Triton Inference Server with BentoML through bentoml.triton. You can read more about this integration from our documentation. If you enjoyed this article, feel free to support us by starring our GitHub, and join our community Slack channel!