OpenLLM: Run Any Open-Source LLM as an API with One Command

By Prahlad Menon 3 min read

Want to run DeepSeek R1, Llama 3.3, or Qwen2.5 as your own private API? OpenLLM makes it trivially easy — one command, and you have an OpenAI-compatible endpoint ready to go.

Why This Matters

The gap between “I want to use an open-source LLM” and “I have a production-ready API endpoint” has traditionally involved a lot of infrastructure work. OpenLLM collapses that gap to a single command.

Quick Start

pip install openllm
openllm hello  # Interactive demo

To serve a model:

openllm serve llama3.2:1b

That’s it. You now have an API at http://localhost:3000 that’s compatible with OpenAI’s client libraries.

The Model Library

OpenLLM supports the latest open-source models:

ModelParametersGPU Required
DeepSeek R1671B80G x16
Llama 417B80G x8
Llama 3.370B80G x2
Qwen 2.57B24G
Phi 414B80G
Gemma 33B12G

The full list lives in bentoml/openllm-models.

Use It Like OpenAI

Since OpenLLM exposes an OpenAI-compatible API, you can use it with any existing OpenAI tooling:

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:3000/v1',
    api_key='na'  # Not required for local
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
)

Works with LlamaIndex, LangChain, and any framework that supports OpenAI-compatible APIs.

Built-in Chat UI

OpenLLM includes a web chat interface at /chat. No need to build your own frontend for testing.

Cloud Deployment

For production, OpenLLM integrates with BentoCloud:

openllm deploy llama3.2:1b --env HF_TOKEN

This gives you autoscaling, observability, and managed infrastructure.

Custom Models

You can add your own models by creating a custom repository following the BentoML format. This makes OpenLLM extensible beyond the default model catalog.

Bonus: Know What Will Actually Run

Before you openllm serve, you should know what your hardware can handle. That’s where llmfit comes in — a companion tool that probes your system and tells you exactly which models will run.

curl -fsSL https://llmfit.axjns.dev/install.sh | sh
llmfit

llmfit detects your CPU, RAM, and GPU (NVIDIA, AMD, Intel Arc, Apple Silicon), then scores 157+ models across quality, speed, fit, and context dimensions. It handles:

  • MoE expert offloading — Mixtral 8x7B needs 6.6GB with offloading, not 23.9GB
  • Dynamic quantization — Picks the best quantization (Q8→Q2_K) that fits your RAM
  • Speed estimation — Tokens/sec before you pull the weights
  • Multi-GPU setups — Aggregates VRAM across all detected GPUs

The TUI shows a ranked table of what will run well on your machine. Filter by fit level (Perfect, Good, Marginal) or use case (Coding, Reasoning, Chat).

Workflow: llmfit first to pick your model → openllm serve to run it.

GitHub: AlexsJones/llmfit

When to Use OpenLLM

Use OpenLLM when:

  • You want OpenAI API compatibility with open-source models
  • You need to self-host for privacy/compliance
  • You want quick prototyping with different models
  • You need cloud deployment with autoscaling

Consider alternatives when:

  • You just need local inference (Ollama might be simpler)
  • You need maximum performance tuning (vLLM directly)
  • You’re already deep in another serving framework

The Bottom Line

OpenLLM is the fastest path from “I want to try this open-source LLM” to “I have a production API.” The BentoML team has done the infrastructure work so you don’t have to.

GitHub: bentoml/OpenLLM


What’s your go-to setup for serving LLMs? Share your stack in the comments.