Recent progress in driving video generation has shown significant potential for enhancing self-driving systems by providing scalable and controllable training data. Although pretrained state-of-the-art generation models, guided by 2D layout conditions (e.g., HD maps and bounding boxes), can produce photorealistic driving videos, achieving controllable multi-view videos with high 3D consistency remains a major challenge. To tackle this, we introduce a novel spatial adaptive generation framework, CoGen, which leverages advances in 3D generation to improve performance in two key aspects: (i) To ensure 3D consistency, we first generate high-quality, controllable 3D conditions that capture the geometry of driving scenes. By replacing coarse 2D conditions with these fine-grained 3D representations, our approach significantly enhances the spatial consistency of the generated videos. (ii) Additionally, we introduce a consistency adapter module to strengthen the robustness of the model to multi-condition control. The results demonstrate that this method excels in preserving geometric fidelity and visual realism, offering a reliable video generation solution for autonomous driving.
翻译:近期驾驶视频生成领域的进展已展现出通过提供可扩展且可控的训练数据来增强自动驾驶系统的巨大潜力。尽管基于二维布局条件(如高清地图和边界框)指导的预训练先进生成模型能够生成逼真的驾驶视频,但实现具有高三维一致性的可控多视角视频仍是一个主要挑战。为此,我们提出了一种新颖的空间自适应生成框架CoGen,该框架利用三维生成技术的进展,在以下两个关键方面提升性能:(i)为确保三维一致性,我们首先生成高质量、可控的三维条件,以捕捉驾驶场景的几何结构。通过用这些细粒度三维表示替换粗糙的二维条件,我们的方法显著增强了生成视频的空间一致性。(ii)此外,我们引入了一个一致性适配器模块,以增强模型对多条件控制的鲁棒性。实验结果表明,该方法在保持几何保真度和视觉真实感方面表现优异,为自动驾驶提供了一种可靠的视频生成解决方案。