An 8B-parameter masked-diffusion model handling text, image, video, and speech — both understanding and generation — in a single architecture. Uses bidirectional context modeling + parallel multi-token prediction for globally conditioned any-to-any inference.

How fast is WeDLM compared to autoregressive models?

Tencent's WeDLM diffusion language model runs 3-6× faster than vLLM-optimized Qwen3-8B on math reasoning, 2-3× faster on code generation, and 1.5-2× faster on open-ended QA. More predictable outputs enable more parallelization.

Does WeDLM sacrifice quality for speed?

No — WeDLM-8B-Instruct actually beats Qwen3-8B-Instruct on most benchmarks: ARC-C 92.92 vs 91.47, GSM8K 92.27 vs 89.91, HumanEval 80.49 vs 71.95, MMLU 75.14 vs 71.52.

What is Frame Guidance for video diffusion?

A training-free framework from KAIST + Adobe giving frame-level control over video diffusion: keyframes, depth, sketch, style, loops — without retraining the model. Just guide the generation process, compatible with CogX and Wan models.

Why is diffusion faster than autoregressive for some tasks?

Autoregressive generates tokens sequentially (1, then 2, then 3...). Diffusion generates multiple tokens simultaneously and refines iteratively. Tasks with predictable structure (math, code) see 3-6× speedups because token distributions are narrower.

What domains might diffusion expand to next?

Robotics (diffusion policies already showing promise), audio beyond TTS, 3D scene generation (NeRF + diffusion), and structured data like SQL, code, and formal specifications.

Diffusion Models Are Eating Everything: March 2026 Roundup

By Prahlad Menon Published 2026-03-10 3 min read

Diffusion models started as an image generation technique. Then they conquered video. Now they’re coming for language, speech, and multimodal understanding — all at once.

Three releases from the past week show where diffusion is headed:

1. Dynin-Omni: One Model, All Modalities

What: An 8B-parameter masked-diffusion model that handles text, image, video, and speech — understanding and generation — in a single architecture.

Why it matters: Unlike autoregressive models that serialize everything into a left-to-right sequence, Dynin-Omni treats all modalities as discrete tokens in a shared vocabulary and generates via iterative masked denoising.

The key insight: Bidirectional context modeling + parallel multi-token prediction = globally conditioned any-to-any inference without modality-specific decoders.

Capabilities

Direction	Task
Text → Text	Chat & reasoning
Image → Text	Image understanding
Video → Text	Video understanding
Text → Image	Image generation
Image → Image	Image editing
Speech → Text	ASR
Text → Speech	TTS

Training Pipeline

Modality Adaptation — Anchor video/speech tokens to shared semantic space
Omni-Modal SFT — Full supervised fine-tuning with model merging to prevent catastrophic forgetting
Continual Capability Scaling — Ongoing improvement

# Inference across any modality
bash scripts/inference.sh --t2i  # Text to image
bash scripts/inference.sh --mmu  # Multimodal understanding
bash scripts/inference.sh --speech  # ASR/TTS

Links: GitHub | HuggingFace | Demo

2. WeDLM: Diffusion Language Models Go Fast

What: Tencent’s diffusion language model that runs 3-6× faster than vLLM-optimized Qwen3-8B on math reasoning tasks.

The breakthrough: They figured out how to do parallel decoding under standard causal attention, making diffusion LLMs compatible with all the existing inference infrastructure — FlashAttention, PagedAttention, CUDA Graphs, KV cache.

Speed vs Quality

Benchmark	Qwen3-8B-Instruct	WeDLM-8B-Instruct
ARC-C (0-shot)	91.47	92.92
GSM8K (3-shot)	89.91	92.27
HumanEval (4-shot)	71.95	80.49
MMLU (5-shot)	71.52	75.14
Average	75.12	77.53

Not only faster — also better on most benchmarks.

Speedup by Task Type

Scenario	Speedup	Why
Math Reasoning	3-6×	Structured, predictable output
Code Generation	2-3×	Deterministic syntax
Open-ended QA	1.5-2×	Higher entropy limits parallelism

The pattern: more predictable outputs = more parallelization = bigger speedups. Math and code benefit most because the token distribution is narrower.

from wedlm import LLM, SamplingParams

llm = LLM(model="tencent/WeDLM-8B-Instruct")
outputs = llm.generate([prompt], SamplingParams(temperature=0.2, max_tokens=512))

Links: HuggingFace | Paper | GitHub

3. Frame Guidance: Training-Free Video Control

What: A framework that gives you frame-level control over video diffusion models — keyframes, depth, sketch, style, loops — without any training.

ICLR 2026 paper from KAIST + Adobe Research.

Control Modes

Task	Description
Keyframe-guided	Force specific frames at specific positions
Color block	Control composition via color regions
Depth	Guide video with depth maps
Sketch	Lineart-to-video
Stylized	Apply style across video
Loop	Create seamless loops

The Key Parameters

--guidance_lr      # Step size η (default: 3e0)
--guidance_step    # Number of guidance steps M
--fixed_frames     # Which frames to control (e.g., [25, 48])
--loss_fn          # Task-specific loss (frame, style, depth, lineart, loop)
--travel_time      # When to apply time-travel steps

Works with: CogX-I2V, CogX-T2V, Wan-I2V, Wan-T2V

The beauty is “training-free” — you’re not fine-tuning the video model, just guiding its generation process. This means you can apply Frame Guidance to any compatible video diffusion model without retraining.

Links: GitHub | Paper | Project Page

The Bigger Picture

What do these three projects have in common?

1. Diffusion Is No Longer “Just Images”

Dynin-Omni: Text + image + video + speech
WeDLM: Pure language modeling
Frame Guidance: Video with precise control

The technique has generalized beyond its original domain.

2. Parallel > Sequential

Both Dynin-Omni and WeDLM exploit diffusion’s inherent parallelism:

Autoregressive: Generate token 1, then token 2, then token 3…
Diffusion: Generate multiple tokens simultaneously, refine iteratively

For tasks with predictable structure, this is a massive win. WeDLM’s 3-6× speedup on math isn’t a fluke — it’s what happens when you can parallelize generation.

3. Training-Free Adaptation

Frame Guidance shows you don’t always need to fine-tune. Guidance-based control lets you steer generation without modifying weights. This is crucial for:

Rapid iteration — Try new control modes without retraining
Composability — Stack multiple guidance signals
Resource efficiency — No GPU-weeks for each new capability

4. Infrastructure Compatibility Matters

WeDLM’s big contribution isn’t just speed — it’s making diffusion LLMs work with existing inference infrastructure. FlashAttention, PagedAttention, vLLM — all the optimizations that make production LLMs fast now work with diffusion.

This is how techniques go from research to production: not by requiring new infrastructure, but by fitting into what already exists.

What’s Next

If diffusion can do language at 3-6× speed with better quality, and omnimodal understanding in a single architecture, and training-free video control…

The question isn’t whether diffusion will expand further. It’s which domains are next.

Candidates:

Robotics — Diffusion policies are already showing promise
Audio — Beyond TTS to full audio generation and understanding
3D — NeRF + diffusion for scene generation
Structured data — SQL, code, formal specifications

The march continues.