超越修复：释放三维理解能力以实现精确相机控制的视频生成 (Beyond Inpainting: Unleash 3D Understanding for Precise Camera-Controlled Video Generation)

Camera control has been extensively studied in conditioned video generation; however, performing precisely altering the camera trajectories while faithfully preserving the video content remains a challenging task. The mainstream approach to achieving precise camera control is warping a 3D representation according to the target trajectory. However, such methods fail to fully leverage the 3D priors of video diffusion models (VDMs) and often fall into the Inpainting Trap, resulting in subject inconsistency and degraded generation quality. To address this problem, we propose DepthDirector, a video re-rendering framework with precise camera controllability. By leveraging the depth video from explicit 3D representation as camera-control guidance, our method can faithfully reproduce the dynamic scene of an input video under novel camera trajectories. Specifically, we design a View-Content Dual-Stream Condition mechanism that injects both the source video and the warped depth sequence rendered under the target viewpoint into the pretrained video generation model. This geometric guidance signal enables VDMs to comprehend camera movements and leverage their 3D understanding capabilities, thereby facilitating precise camera control and consistent content generation. Next, we introduce a lightweight LoRA-based video diffusion adapter to train our framework, fully preserving the knowledge priors of VDMs. Additionally, we construct a large-scale multi-camera synchronized dataset named MultiCam-WarpData using Unreal Engine 5, containing 8K videos across 1K dynamic scenes. Extensive experiments show that DepthDirector outperforms existing methods in both camera controllability and visual quality. Our code and dataset will be publicly available.

翻译：相机控制在条件视频生成领域已得到广泛研究；然而，在忠实保留视频内容的同时精确改变相机轨迹仍是一项具有挑战性的任务。实现精确相机控制的主流方法是根据目标轨迹对三维表示进行变形。然而，此类方法未能充分利用视频扩散模型（VDMs）的三维先验知识，常陷入"修复陷阱"，导致主体不一致和生成质量下降。为解决此问题，我们提出了DepthDirector——一个具备精确相机可控性的视频重渲染框架。通过利用显式三维表示生成的深度视频作为相机控制引导，我们的方法能够在新的相机轨迹下忠实复现输入视频的动态场景。具体而言，我们设计了一种视角-内容双流条件机制，将源视频与在目标视点下渲染的变形深度序列同时注入预训练的视频生成模型中。这种几何引导信号使VDMs能够理解相机运动并利用其三维理解能力，从而实现精确的相机控制和一致的内容生成。其次，我们引入了基于LoRA的轻量级视频扩散适配器来训练框架，完整保留了VDMs的知识先验。此外，我们使用虚幻引擎5构建了大规模多相机同步数据集MultiCam-WarpData，包含1K个动态场景中的8K视频。大量实验表明，DepthDirector在相机可控性和视觉质量方面均优于现有方法。我们的代码与数据集将公开发布。