三维场景提示：面向场景一致且相机可控的视频生成 (3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation)

We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : https://cvlab-kaist.github.io/3DScenePrompt/

翻译：我们提出了3DScenePrompt框架，该框架能够从任意长度的输入视频中生成后续视频片段，同时实现精确的相机控制并保持场景一致性。与基于单张图像或短片段进行条件生成的方法不同，我们采用双重时空条件机制，通过重新构建输入视频中的上下文视角参照关系来实现这一目标。我们的方法同时以时间相邻帧（保证运动连续性）和空间相邻内容（保证场景一致性）作为生成条件。然而，当生成内容超出时间边界时，直接使用空间相邻帧会错误地保留历史动态元素。为此，我们引入了三维场景记忆模块，该模块专门表示从整个输入视频中提取的静态几何结构。为构建此记忆模块，我们采用动态SLAM技术并结合新提出的动态掩码策略，显式地将静态场景几何与运动元素进行分离。静态场景表征可被投影至任意目标视角，从而提供几何一致性的扭曲视图作为强三维空间提示，同时允许动态区域根据时序上下文自然演化。这使得我们的模型能够在保持长距离空间连贯性和精确相机控制的同时，不牺牲计算效率或运动真实感。大量实验表明，本框架在场景一致性、相机可控性和生成质量方面显著优于现有方法。项目页面：https://cvlab-kaist.github.io/3DScenePrompt/