Cohere Transcribe Runs in Your Browser — 1 Hour of Audio in 100 Seconds, Completely Free
Speech recognition just had a quiet but significant moment. Cohere released a 2-billion-parameter transcription model — open-source, Apache 2.0 — that runs entirely in your browser using WebGPU. No installation. No API key. No audio ever leaves your machine.
1 hour of audio transcribed in 100 seconds. Locally. For free.
It currently sits at #1 on HuggingFace’s Open ASR Leaderboard for English accuracy, and matches or beats all existing open-source models across 13 other languages.
What Is Cohere Transcribe?
cohere-transcribe-03-2026 is Cohere’s first audio model. It’s a dedicated ASR (automatic speech recognition) model trained from scratch on 500,000 hours of curated audio-transcript pairs across 14 languages:
European: English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish
East/Southeast Asian: Chinese (Mandarin), Japanese, Korean, Vietnamese
MENA: Arabic
The license is Apache 2.0 — meaning you can use it commercially, self-host it, build products with it, modify it. No strings.
The Architecture: Why It’s Fast
Most recent speech models take a shortcut: they grab a pre-trained text LLM and bolt audio understanding onto it. Models like Qwen-1.7B-ASR and IBM Granite Speech work this way. It’s cheaper to train but slow to run — you’re doing full autoregressive inference through a giant text backbone just to get a transcript.
Cohere took the opposite approach. cohere-transcribe-03-2026 uses a Fast-Conformer encoder-decoder architecture:
- Conformer encoder — interleaves CNN and Transformer layers. CNNs handle local acoustic features (phonemes, rapid sound transitions). Transformers handle long-range linguistic context (sentence meaning, speaker intent). Interleaving them gives you both.
- Lightweight decoder — more than 90% of parameters live in the encoder. The decoder is deliberately small, minimizing autoregressive compute.
The result: throughput 3x higher than similarly-sized competitors. On the RTFx metric (real-time factor — how fast the model processes audio relative to real time), Cohere Transcribe pulls ahead of every other 1B+ model at the same accuracy level.
The training data got serious attention too: 500K hours of curated pairs, synthetic augmentation after error analysis, noise augmentation across 0–30 dB SNR range, a 16k multilingual BPE tokenizer trained in-distribution, and audio decontamination checks to prevent test/train overlap.
The WebGPU Browser Demo
The headline capability is the in-browser inference. Using WebGPU — the modern GPU compute API now available in Chrome and Edge — the model runs entirely client-side. Your audio never touches a server.
This is made possible by the same technology stack that’s been pushing on-device AI forward: WebGPU exposes GPU compute to web applications without requiring CUDA, Metal, or any local install. If your GPU supports WebGPU (most modern discrete GPUs do, and many integrated GPUs), you can run the model from a tab.
The practical implications:
- Privacy-sensitive transcription — medical notes, legal recordings, personal audio — stays on your device
- Zero cost — no API calls, no tokens, no billing
- No setup friction — share a link; recipients can transcribe immediately
The 100 seconds per hour figure is real: on a WebGPU-capable GPU, the model processes audio at roughly 36x real time.
How It Compares
| Model | Size | WER (English) | Languages | License | Browser |
|---|---|---|---|---|---|
| Cohere Transcribe | 2B | #1 Open ASR | 14 | Apache 2.0 | ✅ WebGPU |
| Whisper Large v3 | 1.5B | Strong | 99 | MIT | ❌ |
| Distil-Whisper | 756M | Good | 1 (EN) | MIT | ❌ |
| Qwen3-ASR-1.7B | 1.7B | Competitive | Multi | Apache 2.0 | ❌ |
Cohere wins on accuracy (English) and is competitive across its 14 languages. Whisper still wins on language breadth at 99 languages. The browser execution is unique to Cohere Transcribe for a model of this quality.
Self-Hosting and Production Use
For production, Cohere collaborated with the vLLM team to add native serving support — the PR is merged. That means you can serve Cohere Transcribe with the same open-source stack you’d use for any other LLM, with batching, concurrency management, and all the vLLM performance optimizations.
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="CohereLabs/cohere-transcribe-03-2026",
device="cuda"
)
result = pipe("your_audio.mp3")
print(result["text"])
Or use the model’s native transcribe() method, which handles long-form audio chunking automatically:
from transformers import AutoModel
model = AutoModel.from_pretrained("CohereLabs/cohere-transcribe-03-2026")
transcript = model.transcribe("long_audio.wav")
The model is available on HuggingFace at CohereLabs/cohere-transcribe-03-2026, and also via Cohere’s Model Vault for managed enterprise deployment.
Why This Matters
A few threads converge here that are worth naming.
The “runs in the browser” moment is new. Until recently, state-of-the-art speech recognition meant cloud APIs — OpenAI Whisper API, Google Speech-to-Text, AWS Transcribe. The tradeoff was accuracy for privacy and cost. Cohere Transcribe at this quality level in a browser tab collapses that tradeoff.
WebGPU is quietly becoming a platform. What started as a web graphics API is now capable of running billion-parameter models. The same shift that happened with JavaScript (from toy scripting to production runtime) is happening with the browser as an inference environment.
Open source is winning ASR. A year ago, proprietary APIs were the obvious choice for production-quality transcription. Today, an Apache 2.0 model sits at #1 on the leaderboard, runs locally, and can be self-hosted with vLLM. The calculus for building on proprietary services has shifted.
For anyone building voice agents, transcription pipelines, medical documentation tools, or anything that touches audio — this is the model to evaluate first.
Links: