Diffusion Models Are Eating Everything: March 2026 Roundup

By Prahlad Menon 3 min read

Diffusion models started as an image generation technique. Then they conquered video. Now they’re coming for language, speech, and multimodal understanding — all at once.

Three releases from the past week show where diffusion is headed:

1. Dynin-Omni: One Model, All Modalities

What: An 8B-parameter masked-diffusion model that handles text, image, video, and speech — understanding and generation — in a single architecture.

Why it matters: Unlike autoregressive models that serialize everything into a left-to-right sequence, Dynin-Omni treats all modalities as discrete tokens in a shared vocabulary and generates via iterative masked denoising.

The key insight: Bidirectional context modeling + parallel multi-token prediction = globally conditioned any-to-any inference without modality-specific decoders.

Capabilities

DirectionTask
Text → TextChat & reasoning
Image → TextImage understanding
Video → TextVideo understanding
Text → ImageImage generation
Image → ImageImage editing
Speech → TextASR
Text → SpeechTTS

Training Pipeline

  1. Modality Adaptation — Anchor video/speech tokens to shared semantic space
  2. Omni-Modal SFT — Full supervised fine-tuning with model merging to prevent catastrophic forgetting
  3. Continual Capability Scaling — Ongoing improvement
# Inference across any modality
bash scripts/inference.sh --t2i  # Text to image
bash scripts/inference.sh --mmu  # Multimodal understanding
bash scripts/inference.sh --speech  # ASR/TTS

Links: GitHub | HuggingFace | Demo


2. WeDLM: Diffusion Language Models Go Fast

What: Tencent’s diffusion language model that runs 3-6× faster than vLLM-optimized Qwen3-8B on math reasoning tasks.

The breakthrough: They figured out how to do parallel decoding under standard causal attention, making diffusion LLMs compatible with all the existing inference infrastructure — FlashAttention, PagedAttention, CUDA Graphs, KV cache.

Speed vs Quality

BenchmarkQwen3-8B-InstructWeDLM-8B-Instruct
ARC-C (0-shot)91.4792.92
GSM8K (3-shot)89.9192.27
HumanEval (4-shot)71.9580.49
MMLU (5-shot)71.5275.14
Average75.1277.53

Not only faster — also better on most benchmarks.

Speedup by Task Type

ScenarioSpeedupWhy
Math Reasoning3-6×Structured, predictable output
Code Generation2-3×Deterministic syntax
Open-ended QA1.5-2×Higher entropy limits parallelism

The pattern: more predictable outputs = more parallelization = bigger speedups. Math and code benefit most because the token distribution is narrower.

from wedlm import LLM, SamplingParams

llm = LLM(model="tencent/WeDLM-8B-Instruct")
outputs = llm.generate([prompt], SamplingParams(temperature=0.2, max_tokens=512))

Links: HuggingFace | Paper | GitHub


3. Frame Guidance: Training-Free Video Control

What: A framework that gives you frame-level control over video diffusion models — keyframes, depth, sketch, style, loops — without any training.

ICLR 2026 paper from KAIST + Adobe Research.

Control Modes

TaskDescription
Keyframe-guidedForce specific frames at specific positions
Color blockControl composition via color regions
DepthGuide video with depth maps
SketchLineart-to-video
StylizedApply style across video
LoopCreate seamless loops

The Key Parameters

--guidance_lr      # Step size η (default: 3e0)
--guidance_step    # Number of guidance steps M
--fixed_frames     # Which frames to control (e.g., [25, 48])
--loss_fn          # Task-specific loss (frame, style, depth, lineart, loop)
--travel_time      # When to apply time-travel steps

Works with: CogX-I2V, CogX-T2V, Wan-I2V, Wan-T2V

The beauty is “training-free” — you’re not fine-tuning the video model, just guiding its generation process. This means you can apply Frame Guidance to any compatible video diffusion model without retraining.

Links: GitHub | Paper | Project Page


The Bigger Picture

What do these three projects have in common?

1. Diffusion Is No Longer “Just Images”

  • Dynin-Omni: Text + image + video + speech
  • WeDLM: Pure language modeling
  • Frame Guidance: Video with precise control

The technique has generalized beyond its original domain.

2. Parallel > Sequential

Both Dynin-Omni and WeDLM exploit diffusion’s inherent parallelism:

  • Autoregressive: Generate token 1, then token 2, then token 3…
  • Diffusion: Generate multiple tokens simultaneously, refine iteratively

For tasks with predictable structure, this is a massive win. WeDLM’s 3-6× speedup on math isn’t a fluke — it’s what happens when you can parallelize generation.

3. Training-Free Adaptation

Frame Guidance shows you don’t always need to fine-tune. Guidance-based control lets you steer generation without modifying weights. This is crucial for:

  • Rapid iteration — Try new control modes without retraining
  • Composability — Stack multiple guidance signals
  • Resource efficiency — No GPU-weeks for each new capability

4. Infrastructure Compatibility Matters

WeDLM’s big contribution isn’t just speed — it’s making diffusion LLMs work with existing inference infrastructure. FlashAttention, PagedAttention, vLLM — all the optimizations that make production LLMs fast now work with diffusion.

This is how techniques go from research to production: not by requiring new infrastructure, but by fitting into what already exists.


What’s Next

If diffusion can do language at 3-6× speed with better quality, and omnimodal understanding in a single architecture, and training-free video control…

The question isn’t whether diffusion will expand further. It’s which domains are next.

Candidates:

  • Robotics — Diffusion policies are already showing promise
  • Audio — Beyond TTS to full audio generation and understanding
  • 3D — NeRF + diffusion for scene generation
  • Structured data — SQL, code, formal specifications

The march continues.