A tool that lets you serve any open-source LLM as an OpenAI-compatible API with one command. 'pip install openllm && openllm serve llama3.2:1b' gives you an API at localhost:3000 that works with any OpenAI client library.

What models does OpenLLM support?

DeepSeek R1 (671B), Llama 4 (17B), Llama 3.3 (70B), Qwen 2.5 (7B), Phi 4 (14B), Gemma 3 (3B), and many more. Full list at bentoml/openllm-models. GPU requirements vary from 12G to 80G x16.

What is llmfit and why use it with OpenLLM?

llmfit probes your system and tells you which models will run on your hardware. Handles MoE offloading, dynamic quantization, speed estimation, multi-GPU setups. Workflow: llmfit first to pick model → openllm serve to run it.

How do I deploy OpenLLM to production?

'openllm deploy llama3.2:1b --env HF_TOKEN' deploys to BentoCloud with autoscaling, observability, and managed infrastructure. Also supports Docker/Kubernetes deployment.

When should I use OpenLLM vs Ollama vs vLLM?

Ollama: local development, quick experiments. vLLM: maximum performance, custom inference pipelines. OpenLLM: production APIs, team deployment, OpenAI-compatible endpoints. OpenLLM bridges 'running on laptop' to 'serving in production.'

OpenLLM: Run Any Open-Source LLM as an API with One Command

Q: How do I use OpenLLM with existing OpenAI code?

Just change base_url to 'http://localhost:3000/v1' and set api_key='na'. Works with LlamaIndex, LangChain, and any framework supporting OpenAI-compatible APIs. No other code changes needed.

By Prahlad Menon Published 2026-02-24 3 min read

Want to run DeepSeek R1, Llama 3.3, or Qwen2.5 as your own private API? OpenLLM makes it trivially easy — one command, and you have an OpenAI-compatible endpoint ready to go.

Why This Matters

The gap between “I want to use an open-source LLM” and “I have a production-ready API endpoint” has traditionally involved a lot of infrastructure work. OpenLLM collapses that gap to a single command.

Quick Start

pip install openllm
openllm hello  # Interactive demo

To serve a model:

openllm serve llama3.2:1b

That’s it. You now have an API at http://localhost:3000 that’s compatible with OpenAI’s client libraries.

The Model Library

OpenLLM supports the latest open-source models:

Model	Parameters	GPU Required
DeepSeek R1	671B	80G x16
Llama 4	17B	80G x8
Llama 3.3	70B	80G x2
Qwen 2.5	7B	24G
Phi 4	14B	80G
Gemma 3	3B	12G

The full list lives in bentoml/openllm-models.

Use It Like OpenAI

Since OpenLLM exposes an OpenAI-compatible API, you can use it with any existing OpenAI tooling:

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:3000/v1',
    api_key='na'  # Not required for local
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
)

Works with LlamaIndex, LangChain, and any framework that supports OpenAI-compatible APIs.

Built-in Chat UI

OpenLLM includes a web chat interface at /chat. No need to build your own frontend for testing.

Cloud Deployment

For production, OpenLLM integrates with BentoCloud:

openllm deploy llama3.2:1b --env HF_TOKEN

This gives you autoscaling, observability, and managed infrastructure.

Custom Models

You can add your own models by creating a custom repository following the BentoML format. This makes OpenLLM extensible beyond the default model catalog.

Bonus: Know What Will Actually Run

Before you openllm serve, you should know what your hardware can handle. That’s where llmfit comes in — a companion tool that probes your system and tells you exactly which models will run.

curl -fsSL https://llmfit.axjns.dev/install.sh | sh
llmfit

llmfit detects your CPU, RAM, and GPU (NVIDIA, AMD, Intel Arc, Apple Silicon), then scores 157+ models across quality, speed, fit, and context dimensions. It handles:

MoE expert offloading — Mixtral 8x7B needs 6.6GB with offloading, not 23.9GB
Dynamic quantization — Picks the best quantization (Q8→Q2_K) that fits your RAM
Speed estimation — Tokens/sec before you pull the weights
Multi-GPU setups — Aggregates VRAM across all detected GPUs

The TUI shows a ranked table of what will run well on your machine. Filter by fit level (Perfect, Good, Marginal) or use case (Coding, Reasoning, Chat).

Workflow: llmfit first to pick your model → openllm serve to run it.

GitHub: AlexsJones/llmfit

When to Use OpenLLM

Use OpenLLM when:

You want OpenAI API compatibility with open-source models
You need to self-host for privacy/compliance
You want quick prototyping with different models
You need cloud deployment with autoscaling

Consider alternatives when:

You just need local inference (Ollama might be simpler)
You need maximum performance tuning (vLLM directly)
You’re already deep in another serving framework

The Bottom Line

OpenLLM is the fastest path from “I want to try this open-source LLM” to “I have a production API.” The BentoML team has done the infrastructure work so you don’t have to.

GitHub: bentoml/OpenLLM

What’s your go-to setup for serving LLMs? Share your stack in the comments.