See4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce See4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.

翻译：沉浸式应用需要从非专业视频中合成时空四维内容，而无需昂贵的三维监督。现有视频到四维方法通常依赖人工标注的相机姿态，这类标注不仅耗时费力，且对真实场景视频的鲁棒性不足。近期基于"先扭曲后修复"的方法通过沿新相机轨迹扭曲输入帧，并利用修复模型填充缺失区域，从而从多视角描绘四维场景，减轻了对姿态标签的依赖。然而，这种轨迹到轨迹的建模方式常将相机运动与场景动态相耦合，使建模与推断过程复杂化。本文提出See4D——一种无姿态约束的轨迹到相机框架，该框架通过渲染至固定虚拟相机阵列替代显式轨迹预测，从而将相机控制与场景建模解耦。我们训练视角条件视频修复模型，通过对真实合成的扭曲图像进行去噪来学习鲁棒的几何先验，并在虚拟视角间修复被遮挡或缺失的区域，无需显式三维标注。基于此修复核心，我们设计了时空自回归推断流程，通过遍历虚拟相机样条曲线并采用重叠窗口扩展视频序列，在有限单步复杂度内实现连贯生成。我们在跨视角视频生成与稀疏重建基准任务上验证See4D的有效性。定量指标与定性评估均表明，相较于依赖姿态或轨迹条件的基线方法，本方法具有更优的泛化能力与性能表现，推动了从非专业视频进行实用化四维世界建模的进展。