Peter Grimshaw

ML Inference with BentoML

Published on: September 29, 2025 | Reading Time: 4 min | Last Modified: September 29, 2025

When it comes to deploying machine learning models into production, there’s no shortage of tools available. I’ve been exploring the landscape of ML inference frameworks, trying to understand the trade-offs and strengths of different options. I spent a bit of time investigating BentoML a while back, and really liked user-friendly design and focus on model serving.

How Widely used is BentoML?

For fun—and a bit of insight—I compared three well-known ML serving tools using Google Trends: BentoML, NVIDIA’s Triton Inference Server, and KServe. Unsurprisingly, Triton Server had the highest peak of search interest, hitting its maximum in November 2024. At that point, BentoML only garnered 32% of the search volume Triton received. As of September 2025, Triton is still leading at 83% of its peak, followed by KServe at 73%, while BentoML trails at 36%.

These numbers are all relative to the most searched point in time, but they give an interesting look into community interest. It’s also worth noting that all three are far behind MLflow in terms of search volume. This suggests that experiment tracking remains a bigger concern for many practitioners than model serving. I think it also helps that the ecosystem has honed in on mlflow for experiment tracking but there are many solutions around for model serving.

Applying BentoML in Practice: Time Series Forecasting with Chronos

To put BentoML to the test, I chose to work on a time series forecasting problem as it’s an area I am familiar with. For this, I used Chronos, a pre-trained model developed by Amazon that adapts the architecture of large language models (LLMs) to time series data.

The premise behind Chronos is compelling: both LLMs and time series models are fundamentally about predicting the next item in a sequence. Leveraging this similarity, the researchers at Amazon trained Chronos on 42 open-source time series datasets, and in several cases, it outperformed models trained specifically on those datasets. You can read more about the methodology and benchmarks in the paper here.

I tested Chronos with the Traffic Hourly dataset, which was one of the out-of-domain datasets used by the authors of Chronos. This means the model hadn’t seen this data during training, making it a good test of generalisation.

Getting Chronos Running with BentoML

I pulled the chronos-t5-small model from Hugging Face and began investigating it through some hands-on experimentation (check out the notebook directory in the repo if you’re interested in the details). Deploying it with BentoML was mostly smooth, but I did run into a few hiccups.

One issue involved serialisation errors with Pydantic, which took me a while to work out. It turns out the key was to ensure that the input and output types were correctly defined. Another oddity was that the context parameter was not present in the input specification, seemingly due to naming conventions or how the schema was auto-generated.

Despite these minor setup challenges, it was straightforward to get a model containerised and running inference quickly. With just a bentoml build followed by bentoml containerize I had a working Docker container ready to go. You can even inspect the generated Bento bundle by running bentoml get -o path, which reveals the internal structure, including the Dockerfile and other metadata. The corresponding code is on github here.

Final Thoughts

Exploring model inference tools through the lens of a real-world problem like time series forecasting was both fun and instructive. While BentoML isn’t the most searched-for solution today, it certainly holds its own in terms of usability and developer experience. Meanwhile, tools like Triton Server and KServe continue to attract more attention, each serving different needs and preferences.

The landscape of ML model deployment is still evolving, and it will be interesting to see if a single tool emerges as dominant, or if we continue to see a diversity of approaches. For now, BentoML offers a compelling balance of simplicity and power—especially for quickly moving from prototype to containerized model.