Shimmy: A 5MB Rust Binary That Makes Ollama Look Bloated

By Prahlad Menon 3 min read

I’ve been running Ollama for local inference for over a year. It works. It’s fine. But “fine” is a low bar when you’re running models on hardware you own, and every megabyte of overhead is a megabyte you could give to your model’s context window.

Enter Shimmy — a local LLM inference server written in pure Rust that ships as a single 5MB binary. No Python runtime. No Docker container. No configuration files. Just download and run.

The Numbers That Matter

Let me put this in perspective:

ShimmyOllama
Binary size~5MB~200MB+
Startup time~100msSeveral seconds
Idle RAM~50MB~300MB+
DependenciesZeroGo runtime
Config filesNoneOptional but common

These aren’t benchmarks from a lab. These are the differences you feel — especially on a Mac Mini or a small VPS where every resource counts.

Why It Works as a Drop-in Replacement

Shimmy implements the OpenAI API spec. Not a subset, not a “compatible-ish” variant — the actual /v1/chat/completions endpoint that every tool already speaks. Point your OPENAI_BASE_URL at http://localhost:11435/v1, set the API key to literally anything (Shimmy ignores it), and your existing code just works:

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")
resp = client.chat.completions.create(
    model="your-model-name",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=32,
)

This means Cursor, Continue.dev, any VSCode extension, any OpenAI SDK — they all work without code changes. That’s the kind of compatibility that actually matters.

Zero Configuration Isn’t Marketing Speak

Shimmy auto-discovers GGUF models from your HuggingFace cache, existing Ollama installations, and local directories. Run shimmy list and it shows you what’s available. Run shimmy serve and it picks a port, loads what it finds, and starts serving.

No YAML files. No model manifests. No Modelfile syntax to learn.

Hot model swapping works too — request a different model in your API call and Shimmy loads it on the fly. On a machine with limited RAM, this is genuinely useful. You’re not pre-loading three models you might need; you’re loading what you need when you need it.

The v1.9.0 GPU Story

The latest release bundles all GPU backends into the single binary. CUDA, Vulkan, OpenCL on Linux/Windows; MLX on Apple Silicon. No separate downloads, no feature flag confusion, no “did I compile with the right backend?” debugging sessions.

Your GPU gets detected at runtime. It just works or it falls back to CPU. This is how it should have always been.

For larger models, there’s MOE (Mixture of Experts) support that intelligently splits layers between CPU and GPU. Run 70B+ parameter models on consumer hardware by letting Shimmy figure out the optimal placement. The --cpu-moe flag gives you control when you want it.

What’s the Catch?

Shimmy is newer and smaller in community than Ollama. The model ecosystem around Ollama — the web UIs, the management tools, the integrations — is more mature. If you need Open WebUI or similar frontends, check compatibility first.

But here’s what I’ve found: most of my local LLM usage is API-driven. IDE completions, agent toolchains, scripts that call chat endpoints. For that workflow, Shimmy is strictly better. Lighter, faster, simpler.

The project is MIT-licensed (edit: Apache-2.0 per the repo badges) and the maintainer has made an explicit “free forever” commitment — no asterisks, no pivot-to-paid. At 5,000+ stars and growing, it’s past the “weekend project” stage.

Try It in 30 Seconds

# macOS Apple Silicon
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy
./shimmy serve &
./shimmy list

If you’re running local models and you haven’t tried Shimmy, you’re leaving performance on the table. It’s one of those tools where the engineering speaks for itself — small binary, fast startup, zero config, full compatibility. That’s the Rust promise delivered.