BentoML - v1.0.0

🍱 The wait is over. BentoML has officially released v1.0.0. We are excited to share with you the notable features improvements.

Introduced BentoML Runner, an abstraction for parallel model inference. It allows the compute intensive model inference step to scale separately from the transformation and business logic. The Runner is easily instantiated and invoked, but behind the scenes, BentoML is optimizing for micro-batching and fanning out inference if needed. Here’s a simple example of instantiating a Runner. Learn more about using runners.
Redesigned how models are saved, moved, and loaded with BentoML. We introduced new primitives which allow users to call a save_model() method which saves the model in the most optimal way based the recommended practices of the ML framework. The model is then stored in a flexible local repository where users can use “import” and “export” functionality to push and pull “finalized” models from remote locations like S3. Bentos can be built locally or remotely with these models. Once built, Yatai or bentoctl can easily deploy to the cloud service of your choice. Learn more about preparing models and building bentos.
Enhanced micro-batching capability with the new runner abstraction, batching is even more powerful. When incoming data is spread to different transformation processes, the runner will fan in inferences when inference is invoked. Multiple inputs will be batched into a single inference call. Most ML frameworks implement some form of vectorization which improves performance for multiple inputs at once. Our adaptive batching not only batches inputs as they are received, but also regresses the time of the last several groups of inputs in order to optimize the batch size and latency windows.
Improved reproducibility of the model by recording and locking the dependent library versions. We use the versions to package the correct dependencies so that the environment in which the model runs in production is identical to the environment it was trained in. All direct and transitive dependencies are recorded and deployed with the model when running in production. In our 1.0 version we now support Conda as well as several different ways to customize your pip packages when “building your Bento”. Learn more about building bentos.
Simplified Docker image creation during containerization to generate the right image for you depending on the features that you’ve decided to implement in your service. For example, if your runner specifies that it can run on a GPU, we will automatically choose the right Nvidia docker image as a base when containerizing your service. If needed, we also provide the flexibility to customize your docker image as well. Learn more about containerization.
Improved input and output validation with native type validation rules. Numpy and Pandas DataFrame can specify a static shape or even dynamically infer schema by providing sample data. The Pydantic schema that is produced per endpoint also integrates with our Swagger UI so that each endpoint is better documented for sharing. Learn more about service APIs and IO Descriptors.

⚠️ BentoML v1.0.0 is backward incompatible with v0.13.1. If you wish to stay on the v0.13.1 LTS version, please lock the dependency with bentoml==0.13.1. We have also prepared a migration guide from v0.13.1 to v1.0.0 to help with your project migration. We are committed to supporting the v0.13-LTS versions with critical bug fixes and security patches.

🎉 After years of seeing hundreds of model serving use cases, we are proud to present the official release of BentoML 1.0. We could not have done it without the growth and support of our community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BentoML - v1.0.0