Claude Code, Fully Local: 122B Model, $0/Month, 65 Tokens/Second on a MacBook
The key insight everyone else missed: most local Claude Code setups use a proxy.
The proxy receives Claude Codeās API call, translates it into whatever format the local model server understands, forwards it, gets a response, translates it back, returns it to Claude Code. That translation layer is the bottleneck. It adds latency, complexity, and failure modes.
claude-code-local removes the proxy entirely. A ~200-line Python server speaks the Anthropic Messages API natively. Claude Code points at it and thinks itās talking to Anthropic. Thereās nothing between Claude Code and the model.
The result: 17.6 seconds per task. Not 133 seconds. Zero API fees. Zero cloud. Everything on your Mac.
The Model Roster
Three models, one setup, swap with an environment variable:
| Model | Speed | Params | Active | RAM | Best for |
|---|---|---|---|---|---|
| š¢ Gemma 4 31B (abliterated) | ~15 tok/s | 31B dense | 31B | ~18 GB | Daily coding, fits 32 GB Mac |
| š Llama 3.3 70B (abliterated, 8-bit) | ~7 tok/s | 71B dense | 71B | ~75 GB | Hardest reasoning, full precision |
| šµ Qwen 3.5 122B (MoE) | 65 tok/s | 122B MoE | 10B active | ~75 GB | Max throughput |
The Qwen number deserves explanation. 122 billion parameters sounds like it should be impossibly slow. It isnāt, because itās a Mixture-of-Experts model ā only ~10 billion parameters activate per token. The GPU computes 10B worth of work, not 122B. You get the capability of a 122B model at the inference cost of a 10B model.
65 tokens per second on Apple Silicon.
The Architecture That Makes It Work
Claude Code
ā
ā ANTHROPIC_BASE_URL=http://localhost:8080
ā ANTHROPIC_API_KEY=local
ā¼
[~200-line Python server] āāā speaks Anthropic Messages API natively
ā
ā MLX inference call (no translation)
ā¼
[Apple Silicon MLX runtime]
ā
āāā Gemma 4 31B (~18 GB unified memory)
āāā Llama 3.3 70B (~75 GB unified memory)
āāā Qwen 3.5 122B (~75 GB unified memory)
The server handles the Anthropic API spec ā system messages, user turns, assistant turns, streaming, tool use ā and passes requests directly to MLX. No OpenAI-compatibility shim. No middleware. One direct path.
Quick Start
Requirements: Mac with 32ā128 GB unified memory, Python 3.12+, Claude Code installed
# Clone
git clone https://github.com/nicedreamzapp/claude-code-local.git
cd claude-code-local
# Install MLX
pip install mlx-lm
# Start the server (Qwen 3.5 122B)
MLX_MODEL=Qwen/Qwen3.5-122B-A10B-Instruct \
bash scripts/start-mlx-server.sh
# Point Claude Code at it
export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_API_KEY=local
# Claude Code now runs locally
claude
For Llama 3.3 70B (the repoās custom 8-bit abliterated build):
MLX_MODEL=divinetribe/Llama-3.3-70B-Instruct-abliterated-8bit-mlx \
bash scripts/start-mlx-server.sh
Why MLX?
Appleās MLX framework treats the entire unified memory pool as one addressable space. Thereās no VRAM vs system RAM distinction ā the model weights, activations, and KV cache all live in the same memory the CPU uses.
For 70B+ models, this is the only reason theyāre runnable on a MacBook at all. A discrete GPU with 24 GB VRAM canāt hold 75 GB of weights. Apple Silicon with 128 GB unified memory can.
MLX also uses Metal shaders optimized specifically for Apple Siliconās memory bandwidth characteristics ā not a generic CUDA port, but purpose-built inference kernels.
The Air-Gap Case
The repo was originally built for a specific use case: confidential document analysis that cannot touch cloud infrastructure.
The demo video shows a Llama 3.3 70B instance analyzing a real NDA with Wi-Fi physically off and lsof running live on screen to prove no network connections. The model weights, context, and outputs stay on the device.
The target users: lawyers, doctors, accountants, therapists, contractors ā anyone handling other peopleās private data who wants AI capability without cloud exposure.
The Complete Stack
Three repos combine for a full local-first AI workstation:
- š¤ claude-code-local (this repo) ā the brain. Local model + Claude Code, zero cloud
- š¤ NarrateClaude ā voice input and output, both on-device via Whisper + voice cloning
- š browser-agent ā drives a real Brave browser via CDP, handles iframes and Shadow DOM
The iMessage integration works across all three: send a task from your phone, the Mac runs it, response comes back to your phone. All traffic stays on your local network.
Benchmarks
These are measured numbers from the repo, not estimates:
| Task | cloud API | Local (Qwen 122B) | Local (Llama 70B) |
|---|---|---|---|
| SWE-bench Lite task | ~17.6s | 17.6s | ~45s |
| Proxy-based local setup | ā | 133s | 133s+ |
| Token throughput | ~60 tok/s | 65 tok/s | 7 tok/s |
The 17.6s vs 133s gap is entirely the proxy elimination. Inference time is the same. The translation overhead is eliminated.
Resources
- GitHub: github.com/nicedreamzapp/claude-code-local
- Llama 3.3 70B abliterated (custom 8-bit): huggingface.co/divinetribe/Llama-3.3-70B-Instruct-abliterated-8bit-mlx
- Reddit thread: r/ClaudeAI ā Running Claude Code fully offline
- NarrateClaude: github.com/nicedreamzapp/NarrateClaude
- Browser agent: github.com/nicedreamzapp/browser-agent
- mac-code (related ā more detailed Flash Streaming + Agent research): github.com/menonpg/mac-code