RotorQuant: 10x KV Cache Compression That Beats Google's TurboQuant
If you’ve been following the KV cache compression space, you know Google’s TurboQuant (ICLR 2026) set the bar: 10x memory reduction with acceptable quality loss. It became the benchmark everyone measured against.
RotorQuant just beat it on every axis.
Same 10x compression. Better quality. 28% faster decode. 5x faster prefill. And 44x fewer parameters in the compression transform itself. It’s a drop-in for llama.cpp, and it works on both CUDA and Apple Silicon today.
Here’s what’s actually happening under the hood — and why the math behind it is more interesting than the benchmarks.
What Is KV Cache and Why Does It Eat Your VRAM?
When an LLM processes a long context, it stores the key (K) and value (V) tensors from every attention head for every token it’s seen. This is the KV cache — it’s what lets the model not recompute attention from scratch on every new token.
The problem: it’s enormous. At FP16, a 128K context on an 8B model uses roughly 4.6 GB of VRAM — just for the cache, before the weights. Scale up to 32B parameters or 200K context and you’re looking at multi-GPU territory for what should be a single-card job.
KV cache quantization compresses those stored tensors from 16-bit floats down to 3 or 4 bits. The challenge is doing it without destroying the attention quality that the model depends on.
How TurboQuant Did It (And What It Got Wrong)
TurboQuant’s approach: before quantizing the K and V vectors, apply a Walsh-Hadamard Transform (WHT) — a butterfly network that scrambles the vector across all 128 dimensions. This decorrelates the coordinates, making them easier to quantize uniformly.
The problem with WHT: it’s a dense, global transform. Every element touches every other element. That’s O(d log d) operations and 16,384 floating-point multiply-adds for a 128-dimensional vector. The transform itself becomes a bottleneck, especially during prefill where you’re processing thousands of tokens.
TurboQuant benchmarks on Llama 3.1 8B (3-bit):
- Decode: 93 tok/s
- Prefill: 722 tok/s
- PPL (Wikitext-2): 7.07
Those are the numbers RotorQuant is beating.
The RotorQuant Insight: You Don’t Need the Full Transform
The core paper insight: real attention vectors don’t live in all 128 dimensions equally. They lie on low-rank manifolds. You don’t need a full-rank d×d transform to decorrelate them — small orthogonal blocks applied independently to pairs or quartets of dimensions are sufficient.
This comes from Clifford algebra (specifically Cl(3,0), the algebra of 3D space). A Clifford rotor is a rotation operator that acts via the sandwich product RxR̃ — it rotates a vector in a geometrically meaningful way using only 4 non-zero multivector components instead of 128.
In practice, this led to three progressively simpler implementations:
| Method | Rotation | Params | FMAs (d=128) |
|---|---|---|---|
| RotorQuant | Cl(3,0) rotor | 372 | ~2,400 |
| IsoQuant | 4D quaternion | 128 | 512 |
| PlanarQuant | 2D Givens | 128 | 256 |
| TurboQuant | WHT butterfly | 16,384 | 16,384 |
Each step trades algebraic richness for speed. The surprising result: the simpler rotations win on PPL too, because block-diagonal rotation preserves the directional structure of KV cache vectors better than global WHT scrambling.
The Benchmarks
At the same 10.3x compression ratio (3-bit), on Llama 3.1 8B:
| Config | Decode | Prefill | PPL |
|---|---|---|---|
| FP16 (baseline) | 140 tok/s | 6,156 tok/s | 6.63 |
| IsoQuant 3-bit | 119 tok/s | 3,822 tok/s | 6.91 |
| TurboQuant 3-bit | 93 tok/s | 722 tok/s | 7.07 |
IsoQuant at 3-bit vs TurboQuant at 3-bit:
- PPL: 6.91 vs 7.07 — better quality
- Decode: 119 vs 93 tok/s — 28% faster
- Prefill: 3,822 vs 722 tok/s — 5.3x faster
And an asymmetric mode (planar3/f16) that compresses only the K cache (5x compression) with near-zero PPL loss — only +0.8% perplexity vs FP16 baseline.
What This Means for Context Windows
At 10.3x compression:
| Context | FP16 KV | Compressed | Saved |
|---|---|---|---|
| 8K | 288 MB | 28 MB | 260 MB |
| 32K | 1,152 MB | 112 MB | 1.04 GB |
| 128K | 4,608 MB | 447 MB | 4.16 GB |
Running a 128K context at 32B scale was multi-GPU territory. With 10x KV compression, it’s approaching single 24GB card territory.
Why the Prefill Speed Matters So Much
The 5x prefill speedup isn’t just a nice benchmark number — it changes the use case.
Prefill is what happens when you load a long document, a codebase, or a conversation history. For local LLMs doing RAG, document Q&A, or long-context reasoning, prefill latency is often the bottleneck, not decode speed.
TurboQuant’s WHT is applied during prefill for every token. At 722 tok/s, a 10,000-token document takes 14 seconds just to process. At RotorQuant’s 3,822 tok/s, that drops to under 3 seconds.
There’s also a deferred quantization trick: the K cache stays in FP16 during prefill (zero quantization error while processing the prompt), then tokens get quantized on insertion during decode. This gives better PPL than round-trip quantization and — counterintuitively — makes decode faster than FP16 baseline because there’s no dequant overhead in flash attention.
How to Use It Today
RotorQuant is a fork of llama.cpp with drop-in cache type flags. No model re-training, no format conversion.
# Build with CUDA
git clone https://github.com/johndpope/llama-cpp-turboquant.git
cd llama-cpp-turboquant && git checkout feature/planarquant-kv-cache
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
# Apple Silicon
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
Best quality (3-bit symmetric):
./build/bin/llama-server -m model.gguf --jinja -ngl 99 \
--cache-type-k iso3 --cache-type-v iso3 --host 0.0.0.0 --port 8080
Zero PPL loss (K-only, 5x compression):
./build/bin/llama-server -m model.gguf --jinja -ngl 99 \
--cache-type-k planar3 --cache-type-v f16 --host 0.0.0.0 --port 8080
Benchmark your model:
./build/bin/llama-bench -m model.gguf -ngl 99 -ctk iso3 -ctv iso3 -p 512 -n 128
The needle-in-haystack tests pass at 8K, 32K, and 65K context — meaning the compression isn’t quietly destroying retrieval quality on long contexts.
What’s Still Honest to Say
A few things to keep in mind:
It’s a research fork, not a merged PR. RotorQuant isn’t in mainline llama.cpp yet. There’s an open vLLM feature request too. Production stability depends on whether these get merged upstream.
Benchmarks are on Llama 3.1 8B. The numbers may vary on different architectures, attention implementations, or model sizes. 3-bit compression on a 70B model may behave differently.
The 44x fewer parameters claim refers to the rotation transform (128 vs 16,384 parameters), not the model itself. The model weights are unchanged.
Google’s official TurboQuant implementation is expected in Q2 2026 — the community fork it’s being compared against may not be the final form.
The Bigger Picture
Every few months, a compression technique lands that shifts what’s possible on consumer hardware. RotorQuant follows a pattern: KV cache quantization is getting better faster than model sizes are growing.
The math here is genuinely interesting — the realization that you don’t need a global transform to decorrelate attention vectors, that small orthogonal blocks preserve the structure that actually matters, and that simpler transforms can beat more sophisticated ones when the structure of the data doesn’t require the full machinery.
If this gets merged into mainline llama.cpp, running 32K context on a 13B model on a 16GB GPU stops being a workaround and starts being the default.
Repo: github.com/scrya-com/rotorquant Paper: scrya.com/rotorquant.pdf