How fast is LLaDA2.1 compared to autoregressive models?

892 tokens per second on HumanEval+ with the 100B Flash model — roughly 3.5x faster than comparable autoregressive models like Qwen3-30B-A3B (240 TPS) and Ling-flash-2.0 (257 TPS).

What is LLaDA2.1's draft-and-edit paradigm?

Two operations during generation: Mask-to-Token (M2T) for drafting — filling masked positions with predicted tokens, and Token-to-Token (T2T) for editing — swapping already-placed tokens for better ones. This lets the model go back and fix its own mistakes.

Why do diffusion LLMs have a speed advantage?

Autoregressive models generate sequentially (one token at a time). Diffusion models start with masked tokens and fill them in parallel — like a photograph developing where the entire image sharpens at once. This breaks the sequential speed ceiling.

What are LLaDA2.1's Speedy vs Quality modes?

Both operations are governed by confidence thresholds. Speedy Mode: low thresholds, draft fast, fix later — optimized for throughput. Quality Mode: conservative thresholds, fewer tokens per step but higher accuracy. Same model, two operational personalities.

Does LLaDA2.1's editing hurt quality?

No — in Quality Mode, LLaDA2.1 surpasses LLaDA2.0 benchmark averages for both model sizes. Editing doesn't just enable speed; it actually improves output quality by allowing error correction.

How did Ant Group train LLaDA2.1 for editing?

Three stages: continual pre-training with dual objectives (predict masked + correct noisy tokens), supervised fine-tuning with Multi-Turn Forward training, and the first large-scale RL framework for diffusion LLMs using ELBO-based Block-level Policy Optimization.

LLaDA2.1: The Diffusion LLM That Hits 892 Tokens Per Second

By Prahlad Menon Published 2026-03-02 4 min read

Every large language model you use today generates text one token at a time. Left to right. Like a typewriter that never looks back. This autoregressive approach works, but it has a fundamental speed ceiling.

Diffusion language models (dLLMs) take a different path. Instead of writing sequentially, they start with a canvas of masked tokens and fill everything in parallel — like a photograph developing in a darkroom, the entire image sharpening at once.

The catch? When you fill in many words simultaneously, some clash. And in previous diffusion models, once a token was placed, it was frozen forever. Errors locked in. No way back.

LLaDA2.1 from Ant Group (February 2026) tackles this with a deceptively simple idea: let the model go back and edit its own mistakes.

The Breakthrough: Draft-and-Edit

Standard diffusion LLMs suffer from “exposure bias.” An early mistake poisons downstream context. The model sees its own flawed output, loses confidence, and slows down. Imagine a writer who makes a typo in paragraph one and then hesitates on every sentence afterward — but cannot scroll up and fix it.

LLaDA2.1 introduces two operations during generation:

Mask-to-Token (M2T): The standard move. A masked position gets filled with a predicted token. This is drafting.
Token-to-Token (T2T): The new move. An already-placed token gets swapped for a better one. This is editing.

Both operations are governed by confidence thresholds. This creates a configurable speed-quality dial:

Speedy Mode: Low thresholds. Draft fast, fix later. Optimized for throughput.
Quality Mode: Conservative thresholds. Fewer tokens per step, but higher accuracy. Editing remains a safety net.

Same model, two operational personalities. Choose speed for code generation. Choose quality for complex reasoning.

The Numbers

Three results capture the impact:

Quality improves too. In Quality Mode, LLaDA2.1 surpasses LLaDA2.0 benchmark averages for both model sizes (16B Mini and 100B Flash). Editing doesn’t just enable speed — it improves output quality.
Tokens-per-forward nearly doubles. 5.93 vs. 3.08 TPF for the Flash model in Speedy Mode.
892 tokens per second on HumanEval+. The 100B Flash model with quantization substantially outpaces comparable autoregressive models:
- LLaDA2.1 Flash: 892 TPS
- Qwen3-30B-A3B: 240 TPS
- Ling-flash-2.0: 257 TPS

That’s roughly 3.5x faster than the competition on coding benchmarks.

How They Made It Work

The editing mechanism required training changes across three stages:

Stage 1 — Continual Pre-Training: The model trains on two objectives simultaneously. The standard “predict masked tokens” task (M2T) plus a new objective that introduces random noise into existing tokens and asks the model to correct them (T2T). This builds drafting and error-correction instincts from the ground up.

Stage 2 — Supervised Fine-Tuning: The same dual objective continues with instruction-following data. Multi-Turn Forward (MTF) training exposes the model to diverse editing scenarios.

Stage 3 — Reinforcement Learning: LLaDA2.1 implements what the authors describe as the first large-scale RL framework for diffusion language models. Standard policy gradients require sequence-level log-likelihoods, which are intractable for diffusion models. Their solution: ELBO-based Block-level Policy Optimization uses the Evidence Lower Bound as a tractable proxy, parallelized via Vectorized Likelihood Estimation.

Why This Matters

Autoregressive models have dominated because they work. But their sequential nature means inference speed is fundamentally bounded — you can’t generate the next token until you’ve generated this one.

Diffusion models break that constraint by generating in parallel. LLaDA2.1 shows that with the right training and decoding strategy, you can have both parallelism and error correction. The speed-quality tradeoff becomes a dial rather than a cliff.

For production inference, this has real implications. Code completion, where throughput matters more than latency per token, is an obvious win. But any high-volume generation task benefits.

The 100B parameter scale also matters. LLaDA2.0 proved diffusion could work at scale. LLaDA2.1 proves it can be fast at scale.

What’s Next

This is research from Ant Group, not a product launch. But the paper is out, and the techniques are documented. Expect to see these ideas absorbed into open-source implementations.

For those following the diffusion LLM space: this is probably the most significant advancement since the original LLaDA scaling work in December. The typewriter model of language generation may finally have competition.

Read the technical deep-dive on Qubytes | LLaDA2.0 paper on arXiv