Claude Code, Fully Local: 122B Model, $0/Month, 65 Tokens/Second on a MacBook

By Prahlad Menon 2 min read

The key insight everyone else missed: most local Claude Code setups use a proxy.

The proxy receives Claude Code’s API call, translates it into whatever format the local model server understands, forwards it, gets a response, translates it back, returns it to Claude Code. That translation layer is the bottleneck. It adds latency, complexity, and failure modes.

claude-code-local removes the proxy entirely. A ~200-line Python server speaks the Anthropic Messages API natively. Claude Code points at it and thinks it’s talking to Anthropic. There’s nothing between Claude Code and the model.

The result: 17.6 seconds per task. Not 133 seconds. Zero API fees. Zero cloud. Everything on your Mac.


The Model Roster

Three models, one setup, swap with an environment variable:

ModelSpeedParamsActiveRAMBest for
🟢 Gemma 4 31B (abliterated)~15 tok/s31B dense31B~18 GBDaily coding, fits 32 GB Mac
🟠 Llama 3.3 70B (abliterated, 8-bit)~7 tok/s71B dense71B~75 GBHardest reasoning, full precision
šŸ”µ Qwen 3.5 122B (MoE)65 tok/s122B MoE10B active~75 GBMax throughput

The Qwen number deserves explanation. 122 billion parameters sounds like it should be impossibly slow. It isn’t, because it’s a Mixture-of-Experts model — only ~10 billion parameters activate per token. The GPU computes 10B worth of work, not 122B. You get the capability of a 122B model at the inference cost of a 10B model.

65 tokens per second on Apple Silicon.


The Architecture That Makes It Work

Claude Code
    │
    │ ANTHROPIC_BASE_URL=http://localhost:8080
    │ ANTHROPIC_API_KEY=local
    ā–¼
[~200-line Python server] ←── speaks Anthropic Messages API natively
    │
    │ MLX inference call (no translation)
    ā–¼
[Apple Silicon MLX runtime]
    │
    ā”œā”€ā”€ Gemma 4 31B   (~18 GB unified memory)
    ā”œā”€ā”€ Llama 3.3 70B (~75 GB unified memory)
    └── Qwen 3.5 122B (~75 GB unified memory)

The server handles the Anthropic API spec — system messages, user turns, assistant turns, streaming, tool use — and passes requests directly to MLX. No OpenAI-compatibility shim. No middleware. One direct path.


Quick Start

Requirements: Mac with 32–128 GB unified memory, Python 3.12+, Claude Code installed

# Clone
git clone https://github.com/nicedreamzapp/claude-code-local.git
cd claude-code-local

# Install MLX
pip install mlx-lm

# Start the server (Qwen 3.5 122B)
MLX_MODEL=Qwen/Qwen3.5-122B-A10B-Instruct \
  bash scripts/start-mlx-server.sh

# Point Claude Code at it
export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_API_KEY=local

# Claude Code now runs locally
claude

For Llama 3.3 70B (the repo’s custom 8-bit abliterated build):

MLX_MODEL=divinetribe/Llama-3.3-70B-Instruct-abliterated-8bit-mlx \
  bash scripts/start-mlx-server.sh

Why MLX?

Apple’s MLX framework treats the entire unified memory pool as one addressable space. There’s no VRAM vs system RAM distinction — the model weights, activations, and KV cache all live in the same memory the CPU uses.

For 70B+ models, this is the only reason they’re runnable on a MacBook at all. A discrete GPU with 24 GB VRAM can’t hold 75 GB of weights. Apple Silicon with 128 GB unified memory can.

MLX also uses Metal shaders optimized specifically for Apple Silicon’s memory bandwidth characteristics — not a generic CUDA port, but purpose-built inference kernels.


The Air-Gap Case

The repo was originally built for a specific use case: confidential document analysis that cannot touch cloud infrastructure.

The demo video shows a Llama 3.3 70B instance analyzing a real NDA with Wi-Fi physically off and lsof running live on screen to prove no network connections. The model weights, context, and outputs stay on the device.

The target users: lawyers, doctors, accountants, therapists, contractors — anyone handling other people’s private data who wants AI capability without cloud exposure.


The Complete Stack

Three repos combine for a full local-first AI workstation:

  • šŸ¤– claude-code-local (this repo) — the brain. Local model + Claude Code, zero cloud
  • šŸŽ¤ NarrateClaude — voice input and output, both on-device via Whisper + voice cloning
  • 🌐 browser-agent — drives a real Brave browser via CDP, handles iframes and Shadow DOM

The iMessage integration works across all three: send a task from your phone, the Mac runs it, response comes back to your phone. All traffic stays on your local network.


Benchmarks

These are measured numbers from the repo, not estimates:

Taskcloud APILocal (Qwen 122B)Local (Llama 70B)
SWE-bench Lite task~17.6s17.6s~45s
Proxy-based local setup—133s133s+
Token throughput~60 tok/s65 tok/s7 tok/s

The 17.6s vs 133s gap is entirely the proxy elimination. Inference time is the same. The translation overhead is eliminated.


Resources