How do you run Claude Code locally with no API fees?

Run a local Python server (~200 lines) that speaks the Anthropic Messages API natively. Point Claude Code at it via ANTHROPIC_BASE_URL=http://localhost:8080 ANTHROPIC_API_KEY=local. Claude Code talks to the local server thinking it's Anthropic. The server runs any MLX-compatible model on Apple Silicon — Qwen 3.5 122B at 65 tok/s, Llama 3.3 70B at 7 tok/s, Gemma 4 31B at 15 tok/s.

What Mac do you need to run Claude Code locally?

For Qwen 3.5 122B (65 tok/s) or Llama 3.3 70B: M2/M3/M4/M5 Max or Ultra with 96+ GB unified memory. For Gemma 4 31B (15 tok/s): 32 GB Mac is sufficient (~18 GB model). The server uses Apple's MLX framework for optimized Apple Silicon inference.

Why is the 200-line native Anthropic API server faster than proxy approaches?

Most local Claude Code setups use a proxy that translates between Claude Code's API format and whatever local API the model server speaks (usually OpenAI-compatible). That translation layer adds latency. The 200-line server speaks the Anthropic Messages API natively — Claude Code connects directly, no translation, no overhead. The result is 17.6 seconds per task vs 133 seconds with proxy approaches.

What is the Qwen 3.5 122B model and why is it fast?

Qwen 3.5 122B is a 122-billion-parameter Mixture-of-Experts model. Despite the large parameter count, only ~10 billion parameters activate per token (A10B MoE). This gives it high capability at high speed — 65 tok/s on Apple Silicon — because the GPU only computes 10B worth of work per token, not 122B.

What is abliteration in the context of local models?

Abliteration is a technique that removes refusal behavior from open-weight models by identifying and ablating the model's refusal direction in activation space. Abliterated models respond without safety refusals — important for autonomous coding agents that need to run shell commands, write to filesystem, or interact with sensitive code. The Llama 3.3 70B and Gemma 4 31B builds in this repo are abliterated.

Can you control Claude Code via iPhone iMessage?

Yes — the full stack includes NarrateClaude (voice input/output, both on-device) and a browser agent. You can send a task via iMessage, it triggers Claude Code on your Mac, and the response comes back to your phone. Everything stays on your local network. Works offline.

Is this setup completely private and air-gapped?

Yes if you turn off Wi-Fi. The demo video shows a real NDA document being analyzed by Llama 70B with Wi-Fi physically off and lsof running live to verify no network connections. The model weights, context, and outputs never leave the machine. Designed for lawyers, doctors, accountants — anyone handling confidential data.

What is MLX and why does it matter for local LLM inference?

MLX is Apple's open-source machine learning framework optimized for Apple Silicon's unified memory architecture. Unlike CUDA (NVIDIA) or ROCm (AMD), MLX treats CPU and GPU as a single memory pool — no copying between VRAM and system RAM. This is why Apple Silicon can run 70B+ models that wouldn't fit on a discrete GPU: the entire unified memory is available to the model.

Claude Code, Fully Local: 122B Model, $0/Month, 65 Tokens/Second on a MacBook

By Prahlad Menon Published 2026-04-20 2 min read

The key insight everyone else missed: most local Claude Code setups use a proxy.

The proxy receives Claude Code’s API call, translates it into whatever format the local model server understands, forwards it, gets a response, translates it back, returns it to Claude Code. That translation layer is the bottleneck. It adds latency, complexity, and failure modes.

claude-code-local removes the proxy entirely. A ~200-line Python server speaks the Anthropic Messages API natively. Claude Code points at it and thinks it’s talking to Anthropic. There’s nothing between Claude Code and the model.

The result: 17.6 seconds per task. Not 133 seconds. Zero API fees. Zero cloud. Everything on your Mac.

The Model Roster

Three models, one setup, swap with an environment variable:

Model	Speed	Params	Active	RAM	Best for
🟢 Gemma 4 31B (abliterated)	~15 tok/s	31B dense	31B	~18 GB	Daily coding, fits 32 GB Mac
🟠 Llama 3.3 70B (abliterated, 8-bit)	~7 tok/s	71B dense	71B	~75 GB	Hardest reasoning, full precision
🔵 Qwen 3.5 122B (MoE)	65 tok/s	122B MoE	10B active	~75 GB	Max throughput

The Qwen number deserves explanation. 122 billion parameters sounds like it should be impossibly slow. It isn’t, because it’s a Mixture-of-Experts model — only ~10 billion parameters activate per token. The GPU computes 10B worth of work, not 122B. You get the capability of a 122B model at the inference cost of a 10B model.

65 tokens per second on Apple Silicon.

The Architecture That Makes It Work

Claude Code
    │
    │ ANTHROPIC_BASE_URL=http://localhost:8080
    │ ANTHROPIC_API_KEY=local
    ▼
[~200-line Python server] ←── speaks Anthropic Messages API natively
    │
    │ MLX inference call (no translation)
    ▼
[Apple Silicon MLX runtime]
    │
    ├── Gemma 4 31B   (~18 GB unified memory)
    ├── Llama 3.3 70B (~75 GB unified memory)
    └── Qwen 3.5 122B (~75 GB unified memory)

The server handles the Anthropic API spec — system messages, user turns, assistant turns, streaming, tool use — and passes requests directly to MLX. No OpenAI-compatibility shim. No middleware. One direct path.

Quick Start

Requirements: Mac with 32–128 GB unified memory, Python 3.12+, Claude Code installed

# Clone
git clone https://github.com/nicedreamzapp/claude-code-local.git
cd claude-code-local

# Install MLX
pip install mlx-lm

# Start the server (Qwen 3.5 122B)
MLX_MODEL=Qwen/Qwen3.5-122B-A10B-Instruct \
  bash scripts/start-mlx-server.sh

# Point Claude Code at it
export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_API_KEY=local

# Claude Code now runs locally
claude

For Llama 3.3 70B (the repo’s custom 8-bit abliterated build):

MLX_MODEL=divinetribe/Llama-3.3-70B-Instruct-abliterated-8bit-mlx \
  bash scripts/start-mlx-server.sh

Why MLX?

Apple’s MLX framework treats the entire unified memory pool as one addressable space. There’s no VRAM vs system RAM distinction — the model weights, activations, and KV cache all live in the same memory the CPU uses.

For 70B+ models, this is the only reason they’re runnable on a MacBook at all. A discrete GPU with 24 GB VRAM can’t hold 75 GB of weights. Apple Silicon with 128 GB unified memory can.

MLX also uses Metal shaders optimized specifically for Apple Silicon’s memory bandwidth characteristics — not a generic CUDA port, but purpose-built inference kernels.

The Air-Gap Case

The repo was originally built for a specific use case: confidential document analysis that cannot touch cloud infrastructure.

The demo video shows a Llama 3.3 70B instance analyzing a real NDA with Wi-Fi physically off and lsof running live on screen to prove no network connections. The model weights, context, and outputs stay on the device.

The target users: lawyers, doctors, accountants, therapists, contractors — anyone handling other people’s private data who wants AI capability without cloud exposure.

The Complete Stack

Three repos combine for a full local-first AI workstation:

🤖 claude-code-local (this repo) — the brain. Local model + Claude Code, zero cloud
🎤 NarrateClaude — voice input and output, both on-device via Whisper + voice cloning
🌐 browser-agent — drives a real Brave browser via CDP, handles iframes and Shadow DOM

The iMessage integration works across all three: send a task from your phone, the Mac runs it, response comes back to your phone. All traffic stays on your local network.

Benchmarks

These are measured numbers from the repo, not estimates:

Task	cloud API	Local (Qwen 122B)	Local (Llama 70B)
SWE-bench Lite task	~17.6s	17.6s	~45s
Proxy-based local setup	—	133s	133s+
Token throughput	~60 tok/s	65 tok/s	7 tok/s

The 17.6s vs 133s gap is entirely the proxy elimination. Inference time is the same. The translation overhead is eliminated.

Resources

GitHub: github.com/nicedreamzapp/claude-code-local
Llama 3.3 70B abliterated (custom 8-bit): huggingface.co/divinetribe/Llama-3.3-70B-Instruct-abliterated-8bit-mlx
Reddit thread: r/ClaudeAI — Running Claude Code fully offline
NarrateClaude: github.com/nicedreamzapp/NarrateClaude
Browser agent: github.com/nicedreamzapp/browser-agent
mac-code (related — more detailed Flash Streaming + Agent research): github.com/menonpg/mac-code