1-Bit Bonsai: A 290MB LLM Running at 100 Tokens/Second in Your Browser
Open the Bonsai WebGPU demo in Chrome. Type a prompt. Watch a 1.7B-parameter LLM generate text at ~100 tokens per second — entirely in your browser tab, no server involved.
The model is 290MB. That’s not a typo.
What PrismML Built
PrismML, a lab emerging from Caltech research, has released the Bonsai family of true 1-bit LLMs. Not post-training quantization, not mixed-precision with higher-precision escape hatches — 1-bit end-to-end across the entire network: embeddings, attention layers, MLP layers, and the LM head.
The lineup:
| Model | Size | Speed (M4 Pro) | Speed (RTX 4090) | Speed (Browser) |
|---|---|---|---|---|
| Bonsai 1.7B | 290MB | — | — | ~100 tok/s |
| Bonsai 8B | 1.15GB | 131 tok/s | 368 tok/s | — |
For context, a standard 16-bit 8B model is roughly 16GB and runs at a fraction of those speeds on the same hardware. Bonsai 8B is 14x smaller while remaining competitive on standard benchmarks.
Why 1-Bit Works Now
The idea of extreme quantization isn’t new — Microsoft’s BitNet paper showed the theoretical path in 2023. What’s new is making it practically deployable.
PrismML introduces “intelligence density” as their core metric: useful capability per gigabyte. By that measure, Bonsai 8B scores 1.06/GB versus Qwen3 8B’s 0.10/GB. That’s not an incremental improvement — it’s an order of magnitude.
The key insight: at 1-bit precision, matrix multiplications become additions. This is why the speed gains are so dramatic — the compute bottleneck shifts from arithmetic to memory bandwidth, and a 290MB model fits entirely in fast cache.
The Browser Changes Everything
The 1.7B model running in WebGPU is the real headline. Consider what this enables:
Offline-first apps. An LLM that works without internet, fits in a browser cache, and runs at 100 tok/s. Customer support widgets, form assistants, content tools — all running client-side.
Privacy by architecture. Data never leaves the device. No API calls to intercept, no compliance headaches, no data processing agreements. For healthcare, legal, and enterprise apps, this is transformative.
Edge deployments. Unreliable connectivity? Doesn’t matter. The model is local. IoT devices, field tablets, retail kiosks — anywhere a browser runs.
Zero infrastructure cost. No GPU servers, no API bills, no rate limits. The user’s device is the inference engine.
The r/LocalLLaMA community is already pointing out the practical sweet spot: at 1.7B, you won’t get complex multi-step reasoning, but for classification, summarization, slot-filling, and extraction — tasks that make up the majority of production AI workloads — it’s more than sufficient.
The 8B Model: Serious Capability
The larger Bonsai 8B is where things get genuinely competitive. At 1.15GB, it fits on an iPhone 17 Pro and runs at 44 tok/s. On desktop hardware, the numbers are striking:
- M4 Pro Mac: 131 tok/s (vs. ~15 tok/s for a 16-bit 8B model)
- RTX 4090: 368 tok/s
PrismML demonstrated a simulated agentic workload: 50 ticket summary-and-assignment tasks. Bonsai 8B completed all 50 in the time a standard 16-bit 8B model did 6. For agents that need sustained throughput over many steps, this isn’t just faster — it’s a different class of capability.
Energy consumption drops correspondingly. When you’re moving 14x less data through memory, power draw falls dramatically — making battery-powered and embedded deployments viable.
What This Means for Local AI
We’ve been tracking the local LLM ecosystem closely, and Bonsai represents a phase change. Previous quantization (4-bit, 3-bit GPTQ/AWQ) compressed models by 4-5x with some quality loss. Bonsai compresses by 14x while staying competitive on benchmarks.
The implication: the hardware floor for running a useful LLM just dropped from “needs a decent GPU” to “needs a browser.” That’s a billion-device addressable market overnight.
There are real limitations — 1.7B won’t write your novel or debug complex code. But the vast majority of AI features people actually want in products (smart autocomplete, classification, summarization, extraction, simple Q&A) are well within reach of a model this size, running this fast, at zero marginal cost.
Try It
- Browser Demo — Run Bonsai 1.7B in your browser right now
- PrismML Announcement — Full technical details on Bonsai 8B
- Bonsai 8B GGUF — Download for local inference via llama.cpp/Ollama
The era of browser-native LLMs is here, and it’s faster than anyone expected.