3 Reasons Why Data Scientists Should Care About GRPC For Serving Models

Oct 18, 2022 • Written By Sean Sheng

Introduction

With the end of Moore’s Law, the future of computing is distributed. Consider an application that requires scoring an ensemble of models, processing input data, and performing feature transformation. These workloads will likely run on separate machines to workaround resource limits. Distributing the workloads also improves runtime efficiency by running each type of workloads on the most optimal hardware profiles (CPU or GPU-intensive) and scaling each type of workloads independently based on usage.

A common practice is to implement the aforementioned workloads as distributed services using gRPC. gRPC is a powerful framework that has emerged and matured in the last decade. It helps distributed services create well-defined interfaces and protocols, which become the language distributed services use to communicate among each other. However, not only is building distributed services hard, the skillset is also not core to most data science teams. Building ML services from scratch using gRPC is costly both in terms of engineering and time to production. Most teams simply cannot afford it.

Thanks to the advancement of modern model serving frameworks, building ML services has been significantly simplified for data science teams. At the same time, most of these frameworks support gRPC as a service protocol while abstracting out the complexities of gRPC. With the recent release of the gRPC preview in BentoML, this article, using practical examples, will discuss 3 reasons why data scientists should care about gRPC for model serving.

A generated image from the prompt “a cartoon bento box with delicious food items” with Stable Diffusion model served using BentoML over gRPC.

Multi-Language Standardization

The value of the model is not realized until the ML service can be consumed by upstream applications. However, the applications calling the ML service are not necessarily implemented in the same language as the ML service itself. They will require a specialized client and encoding libraries to communicate with ML service. For example, a web application served with Node.js is implemented in Javascript and a Kafka stream processor is typically implemented in Java or Scala. Application developers will have to learn the wire formats and send requests with compatible encodings with a client implemented in the native language. For example, using REST and JSON, callers of the ML service must send the required headers (e.g. content-type) and encode the payload in the corresponding format (e.g. multipart, application/json). There are not explicit rules to govern what formats the headers and payload must be in. Callers will have to trial and error making development and collaboration slow and error-prone.

On the other hand, Python is the language of choice for most data science teams due to its large standard library ecosystem and community support (e.g. numpy, pandas, matplotlib). The popularity will likely grow with Python becoming even more easy-to-use and evolving into a common skillset in the future. As we can expect ML services to continue be written in Python, solving the communication barriers across different languages is going to be critical to building ML powered applications.

gRPC is ideal for breaking down programming language barriers thanks to the help of Protocol Buffers (Protobuf). Protobuf is a language-neutral serialization framework for service interfaces and structured data. A Protobuf schema communicates the expected interface and input format without any ambiguity. On top of that, gRPC can generate the client stubs from the Protobuf schema so applications can interact with the ML service using a client in the native language they are written in, removing the development and communication overhead incurred during collaboration.

Protobuf schemas are defined in .proto files. In BentoML, all services follow the simple interface definition below. The RPC named Call is the most simplistic form of an API that exchanges single requests and responses, what is also known as a unary RPC.

reasons.png

A generated image from the prompt “a cartoon bento box with delicious food items” with Stable Diffusion model served using BentoML over gRPC.

Multi-Language Standardization

The value of the model is not realized until the ML service can be consumed by upstream applications. However, the applications calling the ML service are not necessarily implemented in the same language as the ML service itself. They will require a specialized client and encoding libraries to communicate with ML service. For example, a web application served with Node.js is implemented in Javascript and a Kafka stream processor is typically implemented in Java or Scala. Application developers will have to learn the wire formats and send requests with compatible encodings with a client implemented in the native language. For example, using REST and JSON, callers of the ML service must send the required headers (e.g. content-type) and encode the payload in the corresponding format (e.g. multipart, application/json). There are not explicit rules to govern what formats the headers and payload must be in. Callers will have to trial and error making development and collaboration slow and error-prone.

On the other hand, Python is the language of choice for most data science teams due to its large standard library ecosystem and community support (e.g. numpy, pandas, matplotlib). The popularity will likely grow with Python becoming even more easy-to-use and evolving into a common skillset in the future. As we can expect ML services to continue be written in Python, solving the communication barriers across different languages is going to be critical to building ML powered applications.

gRPC is ideal for breaking down programming language barriers thanks to the help of Protocol Buffers (Protobuf). Protobuf is a language-neutral serialization framework for service interfaces and structured data. A Protobuf schema communicates the expected interface and input format without any ambiguity. On top of that, gRPC can generate the client stubs from the Protobuf schema so applications can interact with the ML service using a client in the native language they are written in, removing the development and communication overhead incurred during collaboration.

Protobuf schemas are defined in .proto files. In BentoML, all services follow the simple interface definition below. The RPC named Call is the most simplistic form of an API that exchanges single requests and responses, what is also known as a unary RPC.

service BentoService { // Call handles methodcaller of given API entrypoint. rpc Call(Request) returns (Response) {} }

The Request Protobuf is composed of two parts, the API name and content. A service in BentoML can expose multiple APIs, the api_name field specifies which API on the service the request should call. The content field represents the input data in the forms of tensors, tabular, JSON, or files (audio, video, image). The Response Protobuf mirrors the Request representing the output data in the content field except the Response Protobuf does not include an api_name field.

message Request { // api_name defines the API entrypoint to call. // api_name is the name of the function defined in bentoml.Service. // Example: // // @svc.api(input=NumpyNdarray(), output=File()) // def predict(input: NDArray[float]) -> bytes: // ... // // api_name is "predict" in this case. string api_name = 1; oneof content { // NDArray represents a n-dimensional array of arbitrary type. NDArray ndarray = 3; // DataFrame represents any tabular data type. We are using // DataFrame as a trivial representation for tabular type. DataFrame dataframe = 5; // Series portrays a series of values. This can be used for // representing Series types in tabular data. Series series = 6; // File represents for any arbitrary file type. This can be // plaintext, image, video, audio, etc. File file = 7; // Text represents a string inputs. google.protobuf.StringValue text = 8; // JSON is represented by using google.protobuf.Value. // see https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/struct.proto google.protobuf.Value json = 9; // Multipart represents a multipart message. // It comprises of a mapping from given type name to a subset of aforementioned types. Multipart multipart = 10; // serialized_bytes is for data serialized in BentoML's internal serialization format. bytes serialized_bytes = 2; } // Tensor is similiar to ndarray but with a name // We are reserving it for now for future use. // repeated Tensor tensors = 4; reserved 4, 11 to 13; }

With the above Protobuf schemas, developers can easily generate language bindings to interact with the ML service in the native languages of their applications. BentoML has detailed documentation on client generation in Python, Go, C++, Java, Node.js, and more. You can share this documentation with any team wanting to interact with your service. The following example illustrates the Javascript code a Node.js application can write to interact with your ML service.

function main() { const target = "localhost:3000"; const client = new services.BentoServiceClient( target, grpc.credentials.createInsecure(), ); var ndarray = new pb.NDArray(); ndarray .setDtype(pb.NDArray.DType.DTYPE_FLOAT) .setShapeList([1, 4]) .setFloatValuesList([3.5, 2.4, 7.8, 5.1]); var req = new pb.Request(); req.setApiName("classify").setNdarray(ndarray); client.call(req, function (err, resp) { if (err) { console.log(err.message); if (err.code === grpc.status.INVALID_ARGUMENT) { console.log("Invalid argument", resp); } } else { if (resp.getContentCase() != pb.Response.ContentCase.NDARRAY) { console.error("Only support NDArray response."); } console.log("result: ", resp.getNdarray().toObject()); } }); }

Binary Encoding

Data encoding accuracy, such as floating point precision, is critical to ensuring that the model behaves consistently between development and production. This issue is more commonly known as “train-serve skew”. Data Scientists frequently encounter the problem of having different inference results when the model is served in memory vs when the model is served behind an ML service. While many other factors could be at play, a common contributor is often input and output encoding. In text-based encoding schemes like JSON, structured data such as tensors and tabular are encoded as strings. However, the string format is usually suboptimal for representing numeric data types. This is because the conversion to and from a text format could be lossy depending on the encoder.

gRPC guarantees consistent encoding precision using the Protobuf binary to ensure lossless conversions over the wire. BentoML allows all numeric types supported by Protobuf to be defined in the input and output data. As illustrated below, the data type of a tensor represented by an N-dimensional array can be precisely defined in the dtype field and reliably serialized and deserialized by the Protobuf library. A gRPC BentoML service sends and receives data with certainty, ensuring that both server and client, regardless of implementation, see the same data without surprises.

message NDArray { // Represents data type of a given array. enum DType { // Represents a None type. DTYPE_UNSPECIFIED = 0; // Represents an float type. DTYPE_FLOAT = 1; // Represents an double type. DTYPE_DOUBLE = 2; // Represents a bool type. DTYPE_BOOL = 3; // Represents an int32 type. DTYPE_INT32 = 4; // Represents an int64 type. DTYPE_INT64 = 5; // Represents a uint32 type. DTYPE_UINT32 = 6; // Represents a uint64 type. DTYPE_UINT64 = 7; // Represents a string type. DTYPE_STRING = 8; } // DTYPE is the data type of given array DType dtype = 1; // shape is the shape of given array. repeated int32 shape = 2; // represents a string parameter value. repeated string string_values = 5; // represents a float parameter value. repeated float float_values = 3 [packed = true]; // represents a double parameter value. repeated double double_values = 4 [packed = true]; // represents a bool parameter value. repeated bool bool_values = 6 [packed = true]; // represents a int32 parameter value. repeated int32 int32_values = 7 [packed = true]; // represents a int64 parameter value. repeated int64 int64_values = 8 [packed = true]; // represents a uint32 parameter value. repeated uint32 uint32_values = 9 [packed = true]; // represents a uint64 parameter value. repeated uint64 uint64_values = 10 [packed = true]; }

Performance

It goes without saying that inference latency is one of the most important requirements of an ML service. No matter how good the inference results are, if not returned in a timely manner much of the value can be lost. Consider a product recommendation service on an e-commerce website or an auto-completion type ahead service in an email application. In both of these cases inference results must be returned within a short amount of time to deliver value to their users. Amazon research found that every 100 milliseconds increase in latency reduces sales by 1%. In a world where milliseconds matter, a good ML service framework will optimize for even the smallest performance gains.

gRPC leverages HTTP/2 protocol which offers a set of desirable performance improvements, such as binary encoding, header compression and stream multiplexing. Most importantly, gRPC has a large active community behind it and is battle tested with the most performance critical use cases. The Python implementation of gRPC is still subject to the Global Interpreter Lock (GIL) allowing only one thread to run at any given time. BentoML solves the GIL problem in its architecture by scheduling multiple instances of the gRPC servers (same as the number of CPU cores) based on the underlying hardware to achieve the maximum parallelism. Benchmark studies suggests that gRPC is multiple times fastercompared to the REST equivalence. With binary encoding, structured data payload like tensors and tabular can be 30% smaller in gRPC compared to REST using JSON. It is worth noting that the latency reduction is most pronounced with lighter models where the IO overhead is more significant compared to the compute time.

gRPC trades off ease-of-use for flexibility. While gRPC is flexible to build practically any services, developing ML services using gRPC from scratch is hard and requires the help other infrastructures like Docker for containerization and Kubernetes for deployment. This approach can be costly both in terms of engineering and time-to-market. Many model serving frameworks make it easy to enable gRPC by abstracting away infrastructure details from the developers. For example, in BentoML, services can be deployed with either HTTP or gRPC without changing a single line of code.

# Serving bentos over HTTP bentoml serve-http iris_classifier:latest # Serving bentos over gRPC bentoml serve-grpc iris_classifier:latest

Check out the 10-minute tutorial on how to serve models over gRPC in BentoML.

Conclusions

gRPC is a powerful framework that comes with a list of out-of-the-box benefits valuable to data science teams at all stages.

1. Multi-language support allows data scientists to work with the languages and libraries that they are most familiar with and leverage the standardized Protobuf to generate the service interface and clients in the native languages of the upstream applications.

2. Binary encoding guarantees data precision and ensures lossless data conversion over the wire, offering a peace of mind.

3. Performance optimizations helps achieve low latency, especially for lighter and faster models.

Developing ML services using gRPC from scratch is hard. The skillset required is not core to data scientist teams and requires a steep learning curve. It can be costly both in terms of engineering and time-to-market. Consider using a ML serving framework that supports gRPC and provide a good abstraction of the underlying infrastructures.

If you enjoyed this article, please show your support by ⭐ the BentoML Project on GitHub and join the Slack Community to meet more ML practitioners around the world.