OpenLLM: Self-Host Any Open-Source LLM as an OpenAI-Compatible API

By Prahlad Menon 3 min read

You’ve got Ollama for local experimentation. You’ve got vLLM for raw inference speed. But what if you want both — local simplicity with production-ready APIs?

That’s where OpenLLM comes in.

What OpenLLM Actually Does

OpenLLM, built by the BentoML team, lets you serve any open-source LLM as an OpenAI-compatible API with a single command:

pip install openllm
openllm serve llama3.2:1b

That’s it. You now have a server at localhost:3000 that speaks the OpenAI API protocol. Your existing code using the OpenAI Python client? It just works:

from openai import OpenAI

client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[{"role": "user", "content": "Explain RAG in one paragraph"}],
    stream=True,
)

No code changes. No adapter libraries. Just swap the base URL.

Why Not Just Use Ollama?

Ollama is fantastic for local development. I use it daily. But it has limitations:

  • Custom API format — not fully OpenAI-compatible
  • Limited production features — no built-in autoscaling, observability, or enterprise deployment
  • Single-user focus — designed for developers, not teams

OpenLLM fills the gap between “running models on my laptop” and “serving models in production.” It uses vLLM as the inference backend (the same engine powering most production LLM deployments) and adds:

  • Full OpenAI API compatibility — streaming, function calling, the works
  • Built-in Chat UI — hit /chat and you’ve got a web interface
  • Docker/Kubernetes ready — one command to containerize
  • BentoCloud integration — managed deployment with autoscaling

Supported Models

The model support is impressive. DeepSeek R1, Llama 4, Qwen 2.5, Mistral, Phi-4, Gemma 3 — basically any model worth running in 2026:

ModelCommand
Llama 3.3 70Bopenllm serve llama3.3:70b
DeepSeek R1openllm serve deepseek:r1-671b
Qwen 2.5 7Bopenllm serve qwen2.5:7b
Mistral 8Bopenllm serve mistral:8b-2410
Phi-4 14Bopenllm serve phi4:14b

For gated models (like Llama), you’ll need a Hugging Face token:

export HF_TOKEN=your_token_here
openllm serve llama3.2:1b

The Production Path

Here’s where OpenLLM shines. You’ve prototyped locally. Now you need to deploy.

Option 1: Docker

OpenLLM models are packaged as “Bentos” — BentoML’s deployment units. Export to Docker and you’re container-ready.

Option 2: BentoCloud

One command deployment with autoscaling:

openllm deploy llama3.2:1b --env HF_TOKEN

You get managed infrastructure, automatic scaling, and a dashboard. It’s the “Vercel for LLMs” experience — deploy in seconds, scale automatically, pay for what you use.

Custom Models

Running a fine-tuned model? OpenLLM supports custom model repositories. Build your model as a Bento, push to your repo, and serve it the same way:

openllm repo add my-models https://github.com/myorg/my-models
openllm serve my-custom-llama:v1

This is crucial for enterprises. You can maintain a private catalog of approved, fine-tuned models and deploy them with the same tooling.

When to Use What

Here’s my mental model:

  • Ollama — Local development, quick experiments, personal use
  • vLLM — Maximum performance, custom inference pipelines, research
  • OpenLLM — Production APIs, team deployment, OpenAI-compatible endpoints

If you’re building an application that calls OpenAI today but want the option to switch to open-source models tomorrow, OpenLLM makes that migration trivial. Same API, different backend.

The Bottom Line

The LLM serving landscape is maturing fast. We’ve moved past “can we run these models” to “how do we run them well in production.”

OpenLLM represents the next step: production-grade inference with zero API changes. Your code stays the same. Your infrastructure scales. Your costs drop.

If you’re still paying OpenAI prices for tasks that Llama 3.3 handles fine, OpenLLM is your exit ramp. One command to start, one command to deploy.

That’s the pitch. Check out the repo and give it a spin.