LongLive: NVIDIA's Real-Time Interactive Video Generation You Can Actually Steer
Every video generation tool you’ve used works the same way: you write a prompt, you wait, you get a clip. If you don’t like it or want to take the story somewhere different, you start over.
LongLive — from NVIDIA, MIT, HKUST, and Tsinghua — does something fundamentally different. It generates video frame by frame in real time, and it accepts new text prompts while it’s generating. The video responds to your input as it plays. You’re not rendering a scene — you’re having a conversation with it.
Accepted at ICLR 2026. Apache 2.0 license. 1.3B parameters.
What “interactive” actually means
Most video generation uses diffusion models — they work backwards from noise to a complete video clip, requiring the entire clip to be computed before you see anything. This makes them high-quality but fundamentally non-interactive. You can’t steer a diffusion model mid-render any more than you can steer a photo mid-exposure.
LongLive uses a frame-level autoregressive architecture instead. It generates one frame at a time, left to right, using causal attention — the same pattern transformers use for text. Because it generates sequentially, it can:
- Accept a new prompt at any point mid-generation
- Transition smoothly to the new scene using a KV-recache mechanism that refreshes its internal state with the new prompt
- Continue generating without restarting, maintaining visual consistency with what came before
In practice: you start generating “a woman walking through a rainy city street.” After 10 seconds you type “she turns into an alley and finds an old door.” The video transitions. No cut. No restart. The generation continues as if the narrative always went that direction.
Real-world examples from the demos
The project page shows what this looks like with real prompts:
-
60 seconds: A product demo video. “A woman introducing the iPhone 15 on Shopee, demonstrating the camera and display.” — you could redirect this mid-stream: “she now shows the battery life, background shows the Shopee checkout page.”
-
60 seconds: “Batman and Joker fight scene in GTA V graphic style, gritty urban environment.” — you could steer the fight outcome, change the location, shift the tone.
-
Narrative arc: Start with “establishing shot of a city at dusk,” transition to “street level, heavy rain begins,” transition to “interior of a bar, character sits down” — each prompt handed off mid-stream.
The YouTube demo shows the transitions in action. They’re notably smooth — the KV-recache mechanism handles prompt switches without jarring visual cuts.
The technical design (simplified)
Three things make real-time interactive generation work:
KV-recache: When you submit a new prompt mid-generation, LongLive doesn’t throw away its internal state and restart. It refreshes the cached attention states with the new prompt, allowing smooth visual transitions that respect both the new direction and the visual context already established.
Short window attention + frame sink: Instead of computing attention across the entire video history (which gets expensive fast), LongLive uses a short sliding window plus a “frame sink” — a few key anchor frames from early in the video that are always kept in context. This preserves long-range visual consistency without the cost of full attention over hundreds of frames.
Streaming long tuning: The model was trained on 5-second clips but fine-tuned to generate long videos by reusing the KV cache across clip boundaries — effectively learning to maintain coherence across much longer timescales than it was originally trained on. Fine-tuning took 32 H100 GPU-days.
What you actually need to run it
Hardware requirements
The official spec is strict:
- GPU: NVIDIA GPU with 40GB+ VRAM (A100 or H100 tested)
- RAM: 64GB system RAM
- OS: Linux
- CUDA: 12.4
A 24GB RTX 4090 does not meet the VRAM requirement. A 16GB RTX 4080 definitely doesn’t. This is a research-grade workload.
INT8 quantization is supported with marginal quality loss and reduces VRAM requirements — it may bring a 4090 into range, but the official team hasn’t tested this. Worth trying if you have one.
Running locally (if you have an A100/H100)
git clone https://github.com/NVlabs/LongLive
cd LongLive
# Create environment
conda create -n longlive python=3.10 -y
conda activate longlive
conda install nvidia/label/cuda-12.4.1::cuda
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
# Download model weights (~10GB total)
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir wan_models/Wan2.1-T2V-1.3B
huggingface-cli download Efficient-Large-Model/LongLive --local-dir longlive_models
# Generate a single video
bash inference.sh
# Interactive mode — accepts prompts mid-generation
bash interactive_inference.sh
Running on cloud (the practical path for most people)
If you don’t have an A100 in your desk, renting one for a few hours is the realistic option:
| Provider | Instance | VRAM | Cost | Notes |
|---|---|---|---|---|
| Lambda Labs | A100 80GB | 80GB | ~$2.49/hr | Most reliable, good availability |
| RunPod | A100 80GB | 80GB | ~$2.09/hr | Spot pricing available, cheaper |
| Vast.ai | A100 80GB | 80GB | ~$1.50–2/hr | Cheapest, community GPUs |
| Google Colab Pro+ | A100 | 40GB | ~$50/mo | Limited session length, may interrupt |
For a typical 1–2 hour session generating demo videos: $3–6 on RunPod or Vast.ai. Lambda Labs is more reliable if availability is an issue.
Recommended cloud workflow:
- Rent an A100 80GB on RunPod or Lambda Labs
- Use a PyTorch 2.x + CUDA 12.4 base image (both providers offer this as a template)
- Clone the repo, run the pip installs, download weights
- Run
interactive_inference.shand start generating
Total setup time from cold start: ~20 minutes (mostly model download).
The interactive UI
The official repo uses command-line scripts. For a more usable experience, community member @yondonfu built Scope — a web UI wrapper for LongLive:
github.com/daydreamlive/scope
It gives you a browser-based text input that feeds prompts to the running LongLive process mid-generation — closer to what the demos show.
License and limitations
License: Apache 2.0 — changed from CC-BY-NC-SA in November 2025. Fully open-source including commercial use. The base model (Wan2.1-T2V-1.3B) is also Apache 2.0.
Real limitations to know:
- Resolution is moderate — this is a 1.3B parameter model, not a Sora-class system
- The 20.7 FPS figure is on H100. On A100 it’ll be slower — still real-time capable but with less headroom
- Visual quality degrades on very long generations even with frame sink (the 240s max is a hard ceiling on a single H100)
- The demos are cherry-picked. Prompt transitions in practice require some iteration to get right
Why this matters
The batch model of video generation — prompt → wait → clip — is a holdover from how diffusion models work, not from any fundamental constraint. LongLive demonstrates that frame-level autoregressive generation can run fast enough to be interactive, and that interactive video can be made visually coherent across prompt transitions.
The practical use cases this enables that weren’t possible before: real-time interactive game cinematics, live streaming content that responds to viewer input, dynamic video backgrounds that change based on context, narrative video tools where the creator directs rather than prompts.
The hardware barrier is real today — you need a data-center GPU to run it well. That will change. The architecture is sound, the model is 1.3B parameters, and quantization is already reducing requirements. The pattern LongLive demonstrates — interactive video as a real-time dialogue — will reach consumer hardware within a year or two.
Sources: GitHub — NVlabs/LongLive · arXiv paper · Project page · HuggingFace model · Scope UI
Related: Avatar Forcing — Real-Time Interactive Head Avatars · LuxTTS — 150x Realtime Voice Cloning