Running a 1 Trillion Parameter Model on a MacBook Pro: The Kimi-K2 Story

By Prahlad Menon 5 min read

The first token took 414 seconds. The throughput was 0.005 tokens per second.

By the end of a single 24-hour session, the same model was running at 1.7 tok/s on a MacBook Pro. 336x faster. Still answering “Paris.”

The model: Kimi-K2 — 1.029 trillion parameters. The hardware: an M4 Max MacBook Pro with 128GB unified memory. The engine: rustane, a Rust-native hybrid inference engine for Apple Neural Engine + Metal.

This is not a toy demo. This is what happens when someone actually debugs a trillion-parameter model on consumer hardware.

The Architecture That Makes It Possible

Kimi-K2 is a Mixture-of-Experts model. It has 384 experts per layer, but only 8 fire per token.

That means 97.9% of the expert weights sit idle every single token. They live on the SSD until the router calls them. Only the backbone — 23GB — stays in RAM.

Total weights: 1TB raw, 524GB converted to working format. Total active per token: about 2.1% of the model.

This is why a trillion-parameter model can run on a laptop. Not because the laptop is big enough to hold it — it isn’t. But because the model is designed to only use a tiny slice of itself at a time.

Three Bugs, 336x Speedup

Getting from 0.005 tok/s to 1.7 tok/s required finding and fixing three separate bugs:

Bug 1: mmap’d 540GB instead of streaming. A 64 vs = 64 mismatch caused the engine to memory-map the entire converted weights file instead of streaming experts on demand. The OS tried to hold 540GB in virtual memory. Everything ground to a halt.

Bug 2: Variable shadow discarded shared expert output. A variable shadowing issue silently threw away the shared expert computation — a structural component of the MoE forward pass. The model was running without part of its own architecture. It still answered questions (the routing experts carried it), but accuracy and coherence were degraded.

Bug 3: Metal shader read garbage for 43% of inputs. A Metal GPU shader was reading from the wrong memory offsets for nearly half of all inputs. Wrong data in, wrong activations out — and yet the model still produced plausible outputs. This is what makes LLM debugging surreal: the model can be broken in fundamental ways and still say “Paris.”

All three bugs fixed. 0.005 → 1.7 tok/s.

The Single Biggest Win: One Syscall

After the bugs were fixed, the largest single performance gain came from one flag:

fcntl(F_NOCACHE)

Every token, the engine reads 8 experts from SSD per layer — about 187MB of pread calls. Without F_NOCACHE, the OS page cache was absorbing all of that, evicting the shared backbone weights that need to stay hot in memory.

Setting F_NOCACHE routes expert reads directly to the SSD controller, bypassing the page cache entirely. The SSD handles locality. The backbone stays warm.

Result: +46% throughput. From 1.15 to 1.68 tok/s. One flag.

K2 vs V3: Why Fewer Heads Wins on Memory-Bound Hardware

The counterintuitive finding: Kimi-K2 runs faster than DeepSeek-V3 on the same hardware, despite having 1.49x more total parameters.

Kimi-K2DeepSeek-V3
Attention heads64128
Backbone size23GB34GB
Free RAM (128GB)91GB50GB
Speed (M4 Max)1.7 tok/sslower

On memory-bound hardware, what matters is not how many parameters exist — it’s how many activate per token, how much stays in RAM, and how much bandwidth the active path consumes.

K2’s smaller backbone means more RAM headroom for the expert cache. Fewer attention heads means less KV cache pressure. The SSD-streaming architecture is the same, but K2’s router sends less data down the critical path per token.

Architecture matters more than parameter count. On memory-bound hardware, that’s the whole game.

What rustane Actually Is

rustane is a Rust-native training and inference engine for Apple Neural Engine + Metal GPU. It uses reverse-engineered private ANE APIs (_ANEClient, _ANEInMemoryModel) to compile and evaluate MIL kernels directly on ANE hardware — no CoreML, no black-box scheduler.

Training pipeline validated from 48M to 5B parameters. Forward pass confirmed to 30B. All on M4 Max 128GB.

The K2 inference work builds on this foundation — the same direct-to-metal approach, applied to streaming trillion-parameter MoE inference from SSD.

The Takeaway

Trillion-parameter frontier models now run on hardware that fits in a backpack. Not at data center speeds. Not comfortably. But they run, they answer correctly, and with the right debugging and one-line optimizations, they run at usable speeds.

The SSD-streaming MoE pattern isn’t a hack — it’s the architecture working as designed. K2 was built to activate 2.1% of itself per token. Someone just had to make the hardware respect that contract.

GitHub: ncdrone/rustane


Frequently Asked Questions

Can you really run a 1 trillion parameter model on a MacBook? Yes — with caveats. Kimi-K2’s Mixture-of-Experts architecture means only 2.1% of parameters activate per token. The backbone (23GB) stays in RAM; experts (hundreds of GB) stream from SSD on demand. On an M4 Max with 128GB unified memory, this achieves 1.7 tokens per second — slow by server standards, but functional on consumer hardware.

What is Kimi-K2? Kimi-K2 is a 1.029 trillion parameter Mixture-of-Experts language model. It uses 384 experts per layer with 8 activated per token, meaning 97.9% of expert weights sit idle at any given moment. This sparse activation pattern is what makes local inference on large hardware feasible.

What is rustane? Rustane is an open-source Rust-native training and inference engine for Apple Silicon. It uses reverse-engineered private Apple Neural Engine APIs to run model computations directly on ANE hardware without CoreML. It supports training up to 5B parameters and forward pass inference up to 30B+ on M4 Max 128GB.

Why does fcntl(F_NOCACHE) improve LLM inference speed on Mac? When streaming expert weights from SSD during MoE inference, the OS page cache absorbs the reads and can evict frequently-needed backbone weights to make room. F_NOCACHE forces direct SSD reads that bypass the page cache, keeping the backbone weights in RAM while the SSD controller handles expert locality. In this case it produced a 46% throughput improvement.

Is Kimi-K2 faster than DeepSeek-V3 on Apple Silicon? In this benchmark, yes — despite having 1.49x more total parameters. K2 has a smaller backbone (23GB vs 34GB), fewer attention heads (64 vs 128), and leaves more RAM free for expert caching. On memory-bound hardware, the active parameter path and memory pressure matter more than raw parameter count.

How do you debug a trillion parameter model? Very carefully, and with patience for slow feedback loops. The K2 debugging session found three critical bugs — wrong memory mapping, variable shadowing, and Metal shader offset errors — all of which produced wrong or degraded results while the model still generated plausible text. LLMs are surprisingly robust to internal corruption, which makes silent bugs especially hard to catch.