Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors. We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task. Project webpage is https://lavr-4d-scene-rerender.github.io/.
翻译:给定单目视频,视频重渲染的目标是从新颖相机轨迹生成场景视图。现有方法面临两个不同挑战:几何无约束模型缺乏空间感知能力,导致视角变化下的漂移和形变;而几何约束模型依赖深度估计和显式重建,易受深度不准确和标定误差影响。我们提出利用大规模4D重建模型潜空间中蕴含的隐式几何知识来约束视频生成过程以应对这些挑战。这些潜变量在连续空间中捕获场景结构而无需显式重建,从而提供灵活的表示方式,使预训练扩散先验能更有效地校正误差。通过联合约束这些潜变量和源相机位姿,我们证明本模型在视频重渲染任务上达到了最优性能。项目网页:https://lavr-4d-scene-rerender.github.io/。