How fast is Mercury 2 compared to other reasoning models?

1,009 tokens/sec on NVIDIA Blackwell GPUs — 5x faster than speed-optimized models like Claude 4.5 Haiku and GPT-5.2 Mini. Not just faster than slow reasoning models, faster than models already optimized for speed.

What makes Mercury 2 different from other reasoning LLMs?

First reasoning-grade diffusion LLM for production use. Diffusion architecture generates multiple tokens in parallel (not sequential like autoregressive models), enabling reasoning-grade quality inside real-time latency budgets.

What is Mercury 2's pricing?

$0.25/1M input tokens, $0.75/1M output tokens. 128K context. OpenAI API compatible — drop-in replacement with no rewrites needed.

What use cases is Mercury 2 best for?

Agentic loops (dozens of inference calls per task), real-time voice (tightest latency budgets), coding/editing (autocomplete, refactors), and RAG pipelines (multi-hop retrieval stacks latencies fast). Anywhere reasoning + low latency matter.

How does Mercury 2 compare to LLaDA2.1?

LLaDA2.1: Ant Group research (892 TPS), training innovations, research paper. Mercury 2: Inception Labs commercial (1,009 TPS), production deployment, live API. Complementary — LLaDA advances science, Mercury 2 packages it for production.

When should I use Mercury 2 vs o3 or DeepSeek-R1?

o3/DeepSeek-R1: maximum capability, latency be damned (quality ceiling). Mercury 2: reasoning-grade quality at speed-tier latency (middle ground). Mercury 2 won't replace o3 for hardest problems, but makes reasoning viable for 90% of latency-sensitive production workloads.

Mercury 2: The First Reasoning Diffusion LLM Is Live — And It's 5x Faster

By Prahlad Menon Published 2026-03-02 4 min read

Reasoning models have a problem: they’re slow.

OpenAI’s o1 and o3. DeepSeek-R1. Anthropic’s extended thinking. They’re powerful — genuinely better at complex tasks — but that power comes from test-time compute. More thinking steps. Longer chains. More retries. And all of it sequential, one token at a time.

For a single query, the latency is tolerable. For production systems running thousands of agentic loops, RAG pipelines, and real-time voice interactions? The cost compounds fast.

Mercury 2 from Inception Labs claims to break that tradeoff. It’s the first reasoning-grade diffusion LLM available for production use — and the benchmarks are striking.

The Numbers

Speed: 1,009 tokens/sec on NVIDIA Blackwell GPUs
Pricing: $0.25/1M input tokens · $0.75/1M output tokens
Context: 128K tokens
Features: Tunable reasoning, native tool use, schema-aligned JSON output
API: OpenAI-compatible — drop-in replacement, no rewrites

That speed claim needs context. Inception says Mercury 2 is 5x faster than leading speed-optimized models like Claude 4.5 Haiku and GPT-5.2 Mini. Not 5x faster than reasoning models (which would be easy) — 5x faster than models already optimized for speed.

Why Diffusion Changes Reasoning Economics

Standard autoregressive LLMs generate text sequentially. Token by token. Left to right. Like a typewriter that can’t look back.

Diffusion LLMs work differently. They start with a canvas of masked tokens and refine everything in parallel — multiple tokens simultaneously, converging over a small number of steps. Less typewriter, more editor revising a full draft at once.

This architectural difference has always promised speed. But previous diffusion LLMs (like LLaDA2.1, which we covered recently) focused on general-purpose generation. Mercury 2 is the first to target reasoning specifically.

The implication: reasoning-grade quality inside real-time latency budgets.

Today, if you want better reasoning, you pay for it with more test-time compute — longer chains, more samples, more retries. Mercury 2 argues that diffusion architecture can deliver that quality without the latency tax.

Where This Matters Most

Inception is positioning Mercury 2 for latency-sensitive production workloads. Their customer quotes reveal the target use cases:

Agentic Loops

Agentic workflows chain dozens of inference calls per task. Each call’s latency compounds. If you can cut latency per call significantly, you can either deliver faster results or afford more steps — more tool calls, more verification, more sophisticated reasoning.

“Mercury 2 is at least twice as fast as GPT-5.2, which is a game changer for us.” — Suchintan Singh, CTO, Skyvern

Real-Time Voice

Voice interfaces have the tightest latency budgets in AI. Users expect natural conversation cadence. Any pause breaks the illusion.

“Mercury 2 has been a big unlock in our voice stack: fast, consistent text generation that keeps the whole experience feeling natural and human.” — Max Sapo, CEO, Happyverse AI

Coding and Editing

Autocomplete, next-edit suggestions, refactors, interactive code agents — anywhere the developer is in the loop and waiting breaks flow.

“Suggestions land fast enough to feel like part of your own thinking, not something you have to wait for.” — Max Brunsfeld, Co-Founder, Zed

RAG Pipelines

Multi-hop retrieval, reranking, and summarization latencies stack fast. Adding reasoning to search has historically meant blowing your latency budget. Mercury 2 aims to change that math.

How It Compares to LLaDA

We recently covered LLaDA2.1 from Ant Group — another diffusion LLM hitting 892 tokens/sec with an innovative draft-and-edit mechanism.

The projects are complementary, not competing:

	LLaDA2.1	Mercury 2
Origin	Ant Group research	Inception Labs (commercial)
Focus	Training innovations (M2T + T2T)	Production deployment
Speed	892 TPS	1,009 TPS
Availability	Research paper	Live API, OpenAI-compatible
Target	General generation	Reasoning workloads

LLaDA2.1 advances the science of how diffusion LLMs can be trained for speed and quality. Mercury 2 packages diffusion into a production-ready reasoning system you can deploy today.

The Bigger Picture

The reasoning model landscape is fragmenting by deployment constraint:

Quality ceiling: o1, o3, DeepSeek-R1 — maximum capability, latency be damned
Speed floor: Haiku, GPT-4o-mini — fast enough for production, reasoning limited
Middle ground: Mercury 2 — reasoning-grade quality at speed-tier latency

If Mercury 2 delivers on its claims, it doesn’t replace o3 for the hardest problems. But it could make reasoning viable for the 90% of production workloads where latency matters more than peak capability.

The OpenAI API compatibility is strategic. No migration cost. No rewrite. Point your existing code at a new endpoint and test whether the speed-quality tradeoff works for your use case.

Getting Started

Mercury 2 is live now at inceptionlabs.ai. The API is OpenAI-compatible, so integration is straightforward:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inceptionlabs.ai/v1",
    api_key="your-inception-key"
)

response = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Your prompt here"}]
)

For enterprise evaluations, Inception offers workload fit analysis and performance validation under expected serving constraints.

Diffusion LLMs have been “almost ready” for years. Mercury 2 is the first to ship a reasoning-focused model at production scale with production pricing. Whether it holds up under real workloads remains to be seen — but the architecture finally matches the ambition.

The typewriter model of text generation may finally have competition.

Related: LLaDA2.1: The Diffusion LLM That Hits 892 Tokens Per Second — the research innovations making this possible.