Diffusion Models Are Eating Everything: March 2026 Roundup
Diffusion models started as an image generation technique. Then they conquered video. Now they’re coming for language, speech, and multimodal understanding — all at once.
Three releases from the past week show where diffusion is headed:
1. Dynin-Omni: One Model, All Modalities
What: An 8B-parameter masked-diffusion model that handles text, image, video, and speech — understanding and generation — in a single architecture.
Why it matters: Unlike autoregressive models that serialize everything into a left-to-right sequence, Dynin-Omni treats all modalities as discrete tokens in a shared vocabulary and generates via iterative masked denoising.
The key insight: Bidirectional context modeling + parallel multi-token prediction = globally conditioned any-to-any inference without modality-specific decoders.
Capabilities
| Direction | Task |
|---|---|
| Text → Text | Chat & reasoning |
| Image → Text | Image understanding |
| Video → Text | Video understanding |
| Text → Image | Image generation |
| Image → Image | Image editing |
| Speech → Text | ASR |
| Text → Speech | TTS |
Training Pipeline
- Modality Adaptation — Anchor video/speech tokens to shared semantic space
- Omni-Modal SFT — Full supervised fine-tuning with model merging to prevent catastrophic forgetting
- Continual Capability Scaling — Ongoing improvement
# Inference across any modality
bash scripts/inference.sh --t2i # Text to image
bash scripts/inference.sh --mmu # Multimodal understanding
bash scripts/inference.sh --speech # ASR/TTS
Links: GitHub | HuggingFace | Demo
2. WeDLM: Diffusion Language Models Go Fast
What: Tencent’s diffusion language model that runs 3-6× faster than vLLM-optimized Qwen3-8B on math reasoning tasks.
The breakthrough: They figured out how to do parallel decoding under standard causal attention, making diffusion LLMs compatible with all the existing inference infrastructure — FlashAttention, PagedAttention, CUDA Graphs, KV cache.
Speed vs Quality
| Benchmark | Qwen3-8B-Instruct | WeDLM-8B-Instruct |
|---|---|---|
| ARC-C (0-shot) | 91.47 | 92.92 |
| GSM8K (3-shot) | 89.91 | 92.27 |
| HumanEval (4-shot) | 71.95 | 80.49 |
| MMLU (5-shot) | 71.52 | 75.14 |
| Average | 75.12 | 77.53 |
Not only faster — also better on most benchmarks.
Speedup by Task Type
| Scenario | Speedup | Why |
|---|---|---|
| Math Reasoning | 3-6× | Structured, predictable output |
| Code Generation | 2-3× | Deterministic syntax |
| Open-ended QA | 1.5-2× | Higher entropy limits parallelism |
The pattern: more predictable outputs = more parallelization = bigger speedups. Math and code benefit most because the token distribution is narrower.
from wedlm import LLM, SamplingParams
llm = LLM(model="tencent/WeDLM-8B-Instruct")
outputs = llm.generate([prompt], SamplingParams(temperature=0.2, max_tokens=512))
Links: HuggingFace | Paper | GitHub
3. Frame Guidance: Training-Free Video Control
What: A framework that gives you frame-level control over video diffusion models — keyframes, depth, sketch, style, loops — without any training.
ICLR 2026 paper from KAIST + Adobe Research.
Control Modes
| Task | Description |
|---|---|
| Keyframe-guided | Force specific frames at specific positions |
| Color block | Control composition via color regions |
| Depth | Guide video with depth maps |
| Sketch | Lineart-to-video |
| Stylized | Apply style across video |
| Loop | Create seamless loops |
The Key Parameters
--guidance_lr # Step size η (default: 3e0)
--guidance_step # Number of guidance steps M
--fixed_frames # Which frames to control (e.g., [25, 48])
--loss_fn # Task-specific loss (frame, style, depth, lineart, loop)
--travel_time # When to apply time-travel steps
Works with: CogX-I2V, CogX-T2V, Wan-I2V, Wan-T2V
The beauty is “training-free” — you’re not fine-tuning the video model, just guiding its generation process. This means you can apply Frame Guidance to any compatible video diffusion model without retraining.
Links: GitHub | Paper | Project Page
The Bigger Picture
What do these three projects have in common?
1. Diffusion Is No Longer “Just Images”
- Dynin-Omni: Text + image + video + speech
- WeDLM: Pure language modeling
- Frame Guidance: Video with precise control
The technique has generalized beyond its original domain.
2. Parallel > Sequential
Both Dynin-Omni and WeDLM exploit diffusion’s inherent parallelism:
- Autoregressive: Generate token 1, then token 2, then token 3…
- Diffusion: Generate multiple tokens simultaneously, refine iteratively
For tasks with predictable structure, this is a massive win. WeDLM’s 3-6× speedup on math isn’t a fluke — it’s what happens when you can parallelize generation.
3. Training-Free Adaptation
Frame Guidance shows you don’t always need to fine-tune. Guidance-based control lets you steer generation without modifying weights. This is crucial for:
- Rapid iteration — Try new control modes without retraining
- Composability — Stack multiple guidance signals
- Resource efficiency — No GPU-weeks for each new capability
4. Infrastructure Compatibility Matters
WeDLM’s big contribution isn’t just speed — it’s making diffusion LLMs work with existing inference infrastructure. FlashAttention, PagedAttention, vLLM — all the optimizations that make production LLMs fast now work with diffusion.
This is how techniques go from research to production: not by requiring new infrastructure, but by fitting into what already exists.
What’s Next
If diffusion can do language at 3-6× speed with better quality, and omnimodal understanding in a single architecture, and training-free video control…
The question isn’t whether diffusion will expand further. It’s which domains are next.
Candidates:
- Robotics — Diffusion policies are already showing promise
- Audio — Beyond TTS to full audio generation and understanding
- 3D — NeRF + diffusion for scene generation
- Structured data — SQL, code, formal specifications
The march continues.