Avatar Forcing: Real-Time AI Avatars That Actually Listen
Most AI avatar systems have a fundamental flaw: they talk at you, not with you.
You feed them a portrait and an audio clip, they render a talking head. The mouth syncs, the head bobs a little, maybe there’s a blink. But there’s no you in the loop. The avatar doesn’t know you nodded. It doesn’t register your laugh. It can’t adjust when you start speaking mid-sentence.
That’s not conversation. That’s a video render.
Avatar Forcing, out of KAIST, NTU Singapore, and DeepAuto.ai, is a CVPR 2026 paper that attacks this problem directly. The goal: an avatar that responds to both what you say and how you react — in real time, on a single GPU.
The Core Problem They’re Solving
Talking-head generation has gotten very good at one thing: making a static portrait speak convincingly. Systems like INFP (CVPR 2025) can produce high-quality, temporally consistent video.
But they’re built on bidirectional transformers — meaning the model sees the entire input sequence, including future frames, before generating anything. This makes the output smooth and consistent, but it makes real-time interaction structurally impossible. You can’t react to something you haven’t seen yet.
Avatar Forcing flips the architecture. It uses causal generation — the model only sees past and present, never future. Every frame is generated based on what’s happened so far, including your live audio and facial motion.
That’s the foundational change that makes real-time interaction possible.
How It Works
Diffusion Forcing
Avatar Forcing is built on Diffusion Forcing — a framework that treats each token in a sequence as having its own independent noise level. This lets the model generate sequences causally (one step at a time) while maintaining temporal consistency.
The key insight: instead of denoising an entire motion chunk at once, Avatar Forcing generates motion latents block by block, with each block informed by real-time user input.
To smooth the boundaries between blocks (a known failure mode in causal generation), they introduce blockwise causal look-ahead masks — a technique that prevents jarring transitions between adjacent motion blocks without breaking the causal constraint.
What It Responds To
The system takes two live inputs from the user:
- Audio — your voice, including silence, laughter, filler sounds
- Motion — your facial movements via webcam (nods, expressions, head tilts)
The avatar generates responses to both simultaneously. If you nod, the avatar acknowledges. If you laugh, it reacts. If you start talking while the avatar is speaking, it can adapt. This is what they call multimodal responsiveness.
The Active Listening Problem
Here’s a subtle but important contribution: the team noticed that most training data for talking-head models is heavily weighted toward speaking — people actively talking. Listening footage (nodding, reacting, staying engaged) is far less expressive and gets underrepresented.
Training on this imbalanced data produces avatars that go stiff or static when they’re supposed to be listening. They look bored.
The fix is clever. Instead of collecting labeled “good listening” data (expensive and subjective), they use Direct Preference Optimization (DPO) — a technique borrowed from RLHF in LLMs — to teach the model what good listening looks like.
They synthetically construct “losing samples” by dropping the user’s conditioning signal, producing deliberately passive, unexpressive motion. They then train the model to prefer the reactive version over the passive one. No human labels needed.
The result: an avatar that stays engaged, mirrors your expressions, and actually looks like it’s paying attention when it’s listening.
Performance
| Metric | Avatar Forcing | Baseline (INFP*) |
|---|---|---|
| Latency | ~500ms | ~3.4s |
| Speedup | 6.8x faster | — |
| Human preference | >80% preferred | — |
| GPU requirement | H100, 14GB VRAM | — |
It outperforms baselines on four metric categories: Reactiveness, Motion Richness, Visual Quality, and Lip Sync. The rPCC scores — measuring correlation between user motion and avatar motion — are notably strong.
How to Try It
The code is not yet released. The paper was accepted to CVPR 2026 and the authors note that code will be available soon.
In the meantime:
-
Watch the demo videos on the project page — they’re genuinely impressive. Pay attention to the listening segments specifically.
-
Read the paper on arXiv — it’s well-written and the ablation studies clearly show the DPO contribution.
-
Star/watch the project page — when code drops, it’ll likely support any portrait + H100 (or equivalent) setup.
-
Check HuggingFace at huggingface.co/papers/2601.00664 for community implementations that may appear before the official release.
Why This Matters
The practical applications are obvious: virtual assistants, telepresence, customer-facing AI agents, digital humans in games. But the technical contribution is worth appreciating on its own.
Getting causal generation to match the quality of bidirectional generation is hard. Adding real-time multimodal input on top is harder. Doing it at 500ms latency on a single GPU — rather than a cluster — is what makes deployment actually realistic.
The DPO trick for listening behavior is also a template worth noting. The problem of “model trained on imbalanced data produces lopsided behavior” appears everywhere in generative AI. Constructing synthetic losing samples to guide preference learning, without any human annotation, is a clean solution that will likely get reused.
Avatar Forcing is a research paper today. In 12 months, expect to see its ideas running in production video call software.
- Paper: arXiv:2601.00664
- Project page: taekyungki.github.io/AvatarForcing
- Code: Coming soon (CVPR 2026)
- Authors: Ki, Taekyung et al. — KAIST, NTU Singapore, DeepAuto.ai