Reshoot-Anything: Refilm Any Video from a New Camera Angle

By Prahlad Menon Published 2026-04-27 6 min read

Imagine you shot a video on your phone — one camera, one angle. Now imagine you could refilm that exact scene from a completely different perspective: a sweeping drone shot, a dolly push-in, or a slow orbit around your subject. That’s exactly what Reshoot-Anything does.

Published as a CVPR Workshop 2026 paper by the team at Morphic Films, Reshoot-Anything is a self-supervised video diffusion model that takes a single monocular video and synthesizes a new version of it from an arbitrary camera trajectory. No multi-camera rig. No 3D scanning. Just your original clip and a new camera path.

Let’s break down how the pipeline works, stage by stage.

The Big Picture

The core insight is deceptively elegant: train a video diffusion model on pseudo multi-view triplets extracted from ordinary monocular footage. By forcing the model to reconstruct views it hasn’t seen, it learns implicit 4D spatiotemporal priors — an internal understanding of how scenes look from different angles and at different times.

At inference, a 4D point-cloud anchor derived from the source video provides geometric conditioning. You pick your new camera trajectory, and the model fills in the rest.

The pipeline has four stages, each with its own model and purpose.

Stage 1: Depth Estimation

Script: estimate_depth.py

Before you can reproject a video into 3D, you need per-frame depth maps. Reshoot-Anything offers two approaches:

GeometryCrafter + MoGe v2 (Default)

This is the recommended path. GeometryCrafter (from Tencent ARC) produces temporally-consistent point maps across frames — critical for smooth 3D reconstruction. It’s then anchored to metric scale using MoGe v2 (from Microsoft), a monocular geometry estimator that gives you real-world scale depth.

Why two models? GeometryCrafter excels at temporal consistency (no flickering depth between frames), while MoGe provides accurate metric scale. Together, they give you both.

DepthCrafter (Alternative)

DepthCrafter is a simpler single-model alternative. It produces consistent video depth directly but may sacrifice some metric accuracy compared to the GeometryCrafter+MoGe combo.

Output: A depths.npz file containing depth maps of shape [T, H, W] in float32 — one depth map per frame.

python estimate_depth.py \
  --video input.mp4 \
  --output depths.npz \
  --method gc_moge  # or: depthcrafter

Stage 2: 4D Camera Trajectory Visualizer

Script: visualizer/app.py (launched via run_visualizer.sh)

This is where the creative direction happens. The depth maps from Stage 1 are unprojected into a 4D point cloud — a three-dimensional representation of your scene that you can scrub through time.

Built on Viser (from the Nerfstudio team), this browser-based visualizer lets you:

Orbit, pan, and zoom through the reconstructed scene
Scrub through frames to see the scene evolve over time
Place camera keyframes at arbitrary positions and orientations
Drag yellow gizmos to fine-tune keyframe placement
Preview the interpolated path before committing
Export the trajectory as cam_info.json

Think of it like a virtual camera placement tool inside your reconstructed scene. You’re essentially storyboarding a reshoot without leaving your browser.

./run_visualizer.sh --video input.mp4 --depth depths.npz
# Open http://localhost:8080

Stage 3: Anchor Conditioning (Point-Cloud Rendering)

Script: render_from_cam_info.py

Now that you have a camera trajectory, the pipeline forward-warps the source video along that path. This creates an “anchor condition pack” — a rough rendering of what the scene looks like from the new viewpoints.

The rendering uses Uni3C’s point rasterization backend, with two options:

gpu_point — GPU-accelerated via PyTorch3D (faster, requires PyTorch3D)
numpy — CPU fallback (slower, no extra dependencies)

The output is a self-contained condition pack:

File	Purpose
`render.mp4`	Novel-view anchor render (black background)
`render_mask.mp4`	Hole mask — white = disoccluded/missing pixels
`render_pink.mp4`	Anchor render on pink background (alternative conditioning)
`reference.png`	First source frame for identity preservation
`cam_info.json`	Camera parameters at render resolution

The key thing to understand: this rendering is incomplete by design. When you change the camera angle, parts of the scene that were occluded in the original become visible — these appear as holes. That’s where Stage 4 comes in.

python render_from_cam_info.py \
  --video input.mp4 \
  --depth depths.npz \
  --cam-info cam_info.json \
  --output-dir render_outputs/my_scene \
  --backend gpu_point

Stage 4: Video Diffusion Inference

Script: inference_wan22_v2v_local.py (launched via run_wan22_inference.sh)

This is the brain of the operation. Reshoot-Anything fine-tunes WAN 2.2 — Alibaba’s 14-billion-parameter image-to-video diffusion model — with LoRA adapters specifically trained for the reshooting task.

The model receives three conditioning inputs:

Source video — the original footage
Anchor render — the rough point-cloud rendering from Stage 3
Reference frame — the first frame, for maintaining visual identity

It then synthesizes a complete, photorealistic video from the new trajectory, hallucinating the disoccluded regions (those holes from Stage 3) while keeping everything consistent with the source.

The Self-Supervised Training Trick

What makes this approach special is how the model was trained. Instead of requiring multi-camera datasets (expensive, rare), the team extracts pseudo multi-view triplets from ordinary monocular video:

Take a video clip
Pick three nearby frames
Treat them as “multi-view” observations of roughly the same scene
Train the model to reconstruct one view given the other two as conditioning

This self-supervised formulation means the model can be trained on virtually unlimited internet video data, learning robust 4D priors without any 3D supervision.

Two-Stage LoRA

The inference uses two separate LoRA weight sets:

High-noise LoRA — handles the early denoising steps (coarse structure, layout)
Low-noise LoRA — handles the later steps (fine details, textures)

This split allows different specializations at different noise levels, improving overall quality.

Hardware Requirements

The default launcher assumes 8 GPUs with Ulysses parallelism for the 14B parameter model. The base WAN 2.2 checkpoint (~28GB) auto-downloads from HuggingFace. This is not a laptop project — you’ll want a serious multi-GPU setup or cloud instance.

Optional: Prompt Extension

You can enhance your text prompts using LLM-based expansion:

Local Qwen — runs a ~14GB model locally for prompt enrichment
DashScope API — uses Alibaba’s cloud API (requires API key)

Putting It All Together

The full workflow:

Shoot a video on any camera (phone, DSLR, GoPro — anything monocular)
Estimate depth — extract per-frame metric depth maps
Design your reshoot — place virtual camera keyframes in the 4D point cloud
Render anchors — forward-warp the source along your new trajectory
Synthesize — let the diffusion model fill in the gaps and produce the final reshot video

What you end up with is a photorealistic video of the same scene, same action, same moment — but filmed from a camera position that never existed.

Why This Matters

This sits at the intersection of several trends:

Democratizing cinematography — reshoot footage without reshooting
Self-supervised 3D understanding — no expensive multi-view datasets needed
Video diffusion maturity — WAN 2.2 as a base shows how capable these models have become
Practical open-source pipelines — not just a paper, but working code with a clear four-stage architecture

The pretrained LoRA weights and training code are still pending release, but the inference pipeline and all supporting tools (depth estimation, visualizer, rendering) are available now.

For filmmakers, VFX artists, and content creators, this could fundamentally change how “coverage” works. Shoot once, reshoot from any angle in post.

GitHub: morphicfilms/video-to-video Paper: CVPR Workshop 2026