CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model's robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.

翻译：电影视频制作需要对场景-主体构图和摄像机运动进行控制，但实景拍摄因需搭建物理布景而成本高昂。为解决此问题，我们提出了解耦场景上下文的电影视频生成任务：给定静态环境的多幅图像，目标是在保持底层场景一致性的同时，遵循用户指定的摄像机轨迹，合成包含动态主体的高质量视频。我们提出CineScene框架，该框架利用隐式三维感知场景表征进行电影视频生成。我们的核心创新是一种新颖的上下文条件注入机制，以隐式方式注入三维感知特征：通过VGGT将场景图像编码为视觉表征，CineScene通过额外的上下文拼接将空间先验注入预训练的文本到视频生成模型，从而实现具有一致场景和动态主体的摄像机可控视频合成。为进一步增强模型鲁棒性，我们在训练过程中引入了一种简单而有效的输入场景图像随机重排策略。针对训练数据匮乏的问题，我们使用Unreal Engine 5构建了场景解耦数据集，其中包含带动态主体与无动态主体的配对场景视频、表征底层静态场景的全景图像及其摄像机轨迹。实验表明，CineScene在场景一致性电影视频生成中实现了最先进的性能，能够处理大幅度的摄像机运动，并在多样化环境中展现出良好的泛化能力。