A 1-bit LLM stores each weight as a single bit (ternary: -1, 0, or +1) instead of the usual 16 bits. This reduces model size by ~14x and enables dramatically faster inference because multiplications become simple additions — making it possible to run billion-parameter models on phones and in browsers.

How big is Bonsai 1.7B?

Just 290MB. For comparison, a standard 16-bit 1.7B model would be roughly 3.4GB. The extreme compression comes from true end-to-end 1-bit quantization across all layers — embeddings, attention, MLP, and LM head.

Can Bonsai really run in a web browser?

Yes. The 1.7B model runs at approximately 100 tokens per second entirely in-browser using WebGPU acceleration. You can try it at huggingface.co/spaces/webml-community/bonsai-webgpu — no downloads, no server, no API key needed.

What is Bonsai good for?

At 1.7B parameters, Bonsai handles classification, summarization, slot-filling, basic extraction, and simple instruction-following well. The 8B model is competitive with full-precision 8B models on standard benchmarks. Neither replaces frontier models for complex reasoning, but they unlock AI in offline, privacy-sensitive, and edge deployment scenarios.

PrismML, an AI lab spun out of Caltech research focused on 'intelligence density' — maximizing useful capability per unit of model size. They offer both the 1.7B and 8B Bonsai models.

1-Bit Bonsai: A 290MB LLM Running at 100 Tokens/Second in Your Browser

By Prahlad Menon Published 2026-04-16 4 min read

Open the Bonsai WebGPU demo in Chrome. Type a prompt. Watch a 1.7B-parameter LLM generate text at ~100 tokens per second — entirely in your browser tab, no server involved.

The model is 290MB. That’s not a typo.

What PrismML Built

PrismML, a lab emerging from Caltech research, has released the Bonsai family of true 1-bit LLMs. Not post-training quantization, not mixed-precision with higher-precision escape hatches — 1-bit end-to-end across the entire network: embeddings, attention layers, MLP layers, and the LM head.

The lineup:

Model	Size	Speed (M4 Pro)	Speed (RTX 4090)	Speed (Browser)
Bonsai 1.7B	290MB	—	—	~100 tok/s
Bonsai 8B	1.15GB	131 tok/s	368 tok/s	—

For context, a standard 16-bit 8B model is roughly 16GB and runs at a fraction of those speeds on the same hardware. Bonsai 8B is 14x smaller while remaining competitive on standard benchmarks.

Why 1-Bit Works Now

The idea of extreme quantization isn’t new — Microsoft’s BitNet paper showed the theoretical path in 2023. What’s new is making it practically deployable.

PrismML introduces “intelligence density” as their core metric: useful capability per gigabyte. By that measure, Bonsai 8B scores 1.06/GB versus Qwen3 8B’s 0.10/GB. That’s not an incremental improvement — it’s an order of magnitude.

The key insight: at 1-bit precision, matrix multiplications become additions. This is why the speed gains are so dramatic — the compute bottleneck shifts from arithmetic to memory bandwidth, and a 290MB model fits entirely in fast cache.

The Browser Changes Everything

The 1.7B model running in WebGPU is the real headline. Consider what this enables:

Offline-first apps. An LLM that works without internet, fits in a browser cache, and runs at 100 tok/s. Customer support widgets, form assistants, content tools — all running client-side.

Privacy by architecture. Data never leaves the device. No API calls to intercept, no compliance headaches, no data processing agreements. For healthcare, legal, and enterprise apps, this is transformative.

Edge deployments. Unreliable connectivity? Doesn’t matter. The model is local. IoT devices, field tablets, retail kiosks — anywhere a browser runs.

Zero infrastructure cost. No GPU servers, no API bills, no rate limits. The user’s device is the inference engine.

The r/LocalLLaMA community is already pointing out the practical sweet spot: at 1.7B, you won’t get complex multi-step reasoning, but for classification, summarization, slot-filling, and extraction — tasks that make up the majority of production AI workloads — it’s more than sufficient.

The 8B Model: Serious Capability

The larger Bonsai 8B is where things get genuinely competitive. At 1.15GB, it fits on an iPhone 17 Pro and runs at 44 tok/s. On desktop hardware, the numbers are striking:

M4 Pro Mac: 131 tok/s (vs. ~15 tok/s for a 16-bit 8B model)
RTX 4090: 368 tok/s

PrismML demonstrated a simulated agentic workload: 50 ticket summary-and-assignment tasks. Bonsai 8B completed all 50 in the time a standard 16-bit 8B model did 6. For agents that need sustained throughput over many steps, this isn’t just faster — it’s a different class of capability.

Energy consumption drops correspondingly. When you’re moving 14x less data through memory, power draw falls dramatically — making battery-powered and embedded deployments viable.

What This Means for Local AI

We’ve been tracking the local LLM ecosystem closely, and Bonsai represents a phase change. Previous quantization (4-bit, 3-bit GPTQ/AWQ) compressed models by 4-5x with some quality loss. Bonsai compresses by 14x while staying competitive on benchmarks.

The implication: the hardware floor for running a useful LLM just dropped from “needs a decent GPU” to “needs a browser.” That’s a billion-device addressable market overnight.

There are real limitations — 1.7B won’t write your novel or debug complex code. But the vast majority of AI features people actually want in products (smart autocomplete, classification, summarization, extraction, simple Q&A) are well within reach of a model this size, running this fast, at zero marginal cost.

Try It

Browser Demo — Run Bonsai 1.7B in your browser right now
PrismML Announcement — Full technical details on Bonsai 8B
Bonsai 8B GGUF — Download for local inference via llama.cpp/Ollama

The era of browser-native LLMs is here, and it’s faster than anyone expected.