Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency. This progress inspires us to investigate the potential of these models to ensure dynamic consistency across various viewpoints, a highly desirable feature for applications such as virtual filming. Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating 6 DoF camera poses. To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints. Specifically, we introduce a multi-view synchronization module to maintain appearance and geometry consistency across these viewpoints. Given the scarcity of high-quality training data, we design a hybrid training scheme that leverages multi-camera images and monocular videos to supplement Unreal Engine-rendered multi-camera videos. Furthermore, our method enables intriguing extensions, such as re-rendering a video from novel viewpoints. We also release a multi-view synchronized video dataset, named SynCamVideo-Dataset. Project page: https://jianhongbai.github.io/SynCamMaster/.
翻译:近期视频扩散模型的进展在模拟真实世界动态和保持3D一致性方面展现出卓越能力。这一进展启发我们探索这些模型在不同视角间确保动态一致性的潜力,该特性对于虚拟拍摄等应用极具价值。与现有专注于单物体多视角生成以实现4D重建的方法不同,我们的研究重点在于从任意视角生成开放世界视频,并融入6自由度摄像机位姿。为此,我们提出一种即插即用模块,能够增强预训练文本到视频模型以实现多摄像机视频生成,确保不同视角间的内容一致性。具体而言,我们引入了多视角同步模块以维持跨视角的外观与几何一致性。鉴于高质量训练数据的稀缺性,我们设计了混合训练方案,综合利用多摄像机图像、单目视频以及Unreal Engine渲染的多摄像机视频进行补充训练。此外,我们的方法支持诸如从新颖视角重渲染视频等拓展应用。我们还发布了名为SynCamVideo-Dataset的多视角同步视频数据集。项目页面:https://jianhongbai.github.io/SynCamMaster/。