How fast is Avatar Forcing?

Avatar Forcing runs at ~500ms latency on a single H100 GPU (14GB VRAM), which is 6.8x faster than baselines like INFP that take ~3.4 seconds. This makes real-time interaction actually possible.

How does Avatar Forcing handle active listening?

Training data is heavily weighted toward speaking, making avatars go stiff when listening. Avatar Forcing uses Direct Preference Optimization (DPO) to teach good listening behavior by constructing 'losing samples' with dropped conditioning, then training to prefer reactive over passive motion.

What inputs does Avatar Forcing use?

The system takes two live inputs: your voice (including silence, laughter, filler sounds) and your facial movements via webcam (nods, expressions, head tilts). The avatar generates responses to both simultaneously — this is multimodal responsiveness.

Why can't bidirectional transformers do real-time interaction?

Bidirectional transformers see the entire input sequence including future frames before generating anything. This makes output smooth but real-time interaction structurally impossible. Avatar Forcing uses causal generation — only seeing past and present, never future.

Is Avatar Forcing code available?

Code is not yet released but will be available soon (CVPR 2026). Watch the demo videos on the project page at taekyungki.github.io/AvatarForcing and check HuggingFace for community implementations.

What are the applications for Avatar Forcing?

Virtual assistants, telepresence, customer-facing AI agents, digital humans in games, and video call software. The ~500ms latency on a single GPU makes deployment realistic rather than requiring a cluster.

Avatar Forcing: Real-Time AI Avatars That Actually Listen

Q: What makes Avatar Forcing different from other talking-head AI?

Most avatar systems talk AT you — they render video from audio without knowing you exist. Avatar Forcing responds to both what you say AND how you react in real time. It registers your nods, laughter, and expressions, and adapts when you start speaking mid-sentence.

By Prahlad Menon Published 2026-03-11 5 min read

Most AI avatar systems have a fundamental flaw: they talk at you, not with you.

You feed them a portrait and an audio clip, they render a talking head. The mouth syncs, the head bobs a little, maybe there’s a blink. But there’s no you in the loop. The avatar doesn’t know you nodded. It doesn’t register your laugh. It can’t adjust when you start speaking mid-sentence.

That’s not conversation. That’s a video render.

Avatar Forcing, out of KAIST, NTU Singapore, and DeepAuto.ai, is a CVPR 2026 paper that attacks this problem directly. The goal: an avatar that responds to both what you say and how you react — in real time, on a single GPU.

The Core Problem They’re Solving

Talking-head generation has gotten very good at one thing: making a static portrait speak convincingly. Systems like INFP (CVPR 2025) can produce high-quality, temporally consistent video.

But they’re built on bidirectional transformers — meaning the model sees the entire input sequence, including future frames, before generating anything. This makes the output smooth and consistent, but it makes real-time interaction structurally impossible. You can’t react to something you haven’t seen yet.

Avatar Forcing flips the architecture. It uses causal generation — the model only sees past and present, never future. Every frame is generated based on what’s happened so far, including your live audio and facial motion.

That’s the foundational change that makes real-time interaction possible.

How It Works

Diffusion Forcing

Avatar Forcing is built on Diffusion Forcing — a framework that treats each token in a sequence as having its own independent noise level. This lets the model generate sequences causally (one step at a time) while maintaining temporal consistency.

The key insight: instead of denoising an entire motion chunk at once, Avatar Forcing generates motion latents block by block, with each block informed by real-time user input.

To smooth the boundaries between blocks (a known failure mode in causal generation), they introduce blockwise causal look-ahead masks — a technique that prevents jarring transitions between adjacent motion blocks without breaking the causal constraint.

What It Responds To

The system takes two live inputs from the user:

Audio — your voice, including silence, laughter, filler sounds
Motion — your facial movements via webcam (nods, expressions, head tilts)

The avatar generates responses to both simultaneously. If you nod, the avatar acknowledges. If you laugh, it reacts. If you start talking while the avatar is speaking, it can adapt. This is what they call multimodal responsiveness.

The Active Listening Problem

Here’s a subtle but important contribution: the team noticed that most training data for talking-head models is heavily weighted toward speaking — people actively talking. Listening footage (nodding, reacting, staying engaged) is far less expressive and gets underrepresented.

Training on this imbalanced data produces avatars that go stiff or static when they’re supposed to be listening. They look bored.

The fix is clever. Instead of collecting labeled “good listening” data (expensive and subjective), they use Direct Preference Optimization (DPO) — a technique borrowed from RLHF in LLMs — to teach the model what good listening looks like.

They synthetically construct “losing samples” by dropping the user’s conditioning signal, producing deliberately passive, unexpressive motion. They then train the model to prefer the reactive version over the passive one. No human labels needed.

The result: an avatar that stays engaged, mirrors your expressions, and actually looks like it’s paying attention when it’s listening.

Performance

Metric	Avatar Forcing	Baseline (INFP*)
Latency	~500ms	~3.4s
Speedup	6.8x faster	—
Human preference	>80% preferred	—
GPU requirement	H100, 14GB VRAM	—

It outperforms baselines on four metric categories: Reactiveness, Motion Richness, Visual Quality, and Lip Sync. The rPCC scores — measuring correlation between user motion and avatar motion — are notably strong.

How to Try It

The code is not yet released. The paper was accepted to CVPR 2026 and the authors note that code will be available soon.

In the meantime:

Watch the demo videos on the project page — they’re genuinely impressive. Pay attention to the listening segments specifically.
Read the paper on arXiv — it’s well-written and the ablation studies clearly show the DPO contribution.
Star/watch the project page — when code drops, it’ll likely support any portrait + H100 (or equivalent) setup.
Check HuggingFace at huggingface.co/papers/2601.00664 for community implementations that may appear before the official release.

Why This Matters

The practical applications are obvious: virtual assistants, telepresence, customer-facing AI agents, digital humans in games. But the technical contribution is worth appreciating on its own.

Getting causal generation to match the quality of bidirectional generation is hard. Adding real-time multimodal input on top is harder. Doing it at 500ms latency on a single GPU — rather than a cluster — is what makes deployment actually realistic.

The DPO trick for listening behavior is also a template worth noting. The problem of “model trained on imbalanced data produces lopsided behavior” appears everywhere in generative AI. Constructing synthetic losing samples to guide preference learning, without any human annotation, is a clean solution that will likely get reused.

Avatar Forcing is a research paper today. In 12 months, expect to see its ideas running in production video call software.

Paper: arXiv:2601.00664
Project page: taekyungki.github.io/AvatarForcing
Code: Coming soon (CVPR 2026)
Authors: Ki, Taekyung et al. — KAIST, NTU Singapore, DeepAuto.ai