While controllable generative models for images and videos have achieved remarkable success, high-quality models for 3D scenes, particularly in unbounded scenarios like autonomous driving, remain underdeveloped due to high data acquisition costs. In this paper, we introduce MagicDrive3D, a novel pipeline for controllable 3D street scene generation that supports multi-condition control, including BEV maps, 3D objects, and text descriptions. Unlike previous methods that reconstruct before training the generative models, MagicDrive3D first trains a video generation model and then reconstructs from the generated data. This innovative approach enables easily controllable generation and static scene acquisition, resulting in high-quality scene reconstruction. To address the minor errors in generated content, we propose deformable Gaussian splatting with monocular depth initialization and appearance modeling to manage exposure discrepancies across viewpoints. Validated on the nuScenes dataset, MagicDrive3D generates diverse, high-quality 3D driving scenes that support any-view rendering and enhance downstream tasks like BEV segmentation. Our results demonstrate the framework's superior performance, showcasing its transformative potential for autonomous driving simulation and beyond.
翻译:尽管图像和视频的可控生成模型已取得显著成功,但由于数据采集成本高昂,针对三维场景(尤其是自动驾驶等无边界场景)的高质量模型仍发展不足。本文提出MagicDrive3D,一种用于可控三维街景生成的新型框架,支持多条件控制,包括鸟瞰图、三维物体和文本描述。与现有方法在训练生成模型前进行三维重建不同,MagicDrive3D首先训练视频生成模型,随后基于生成数据进行重建。这种创新方法实现了便捷的可控生成与静态场景获取,从而获得高质量的场景重建结果。为处理生成内容中的细微误差,我们提出采用单目深度初始化与外观建模的可变形高斯溅射法,以管理不同视角间的曝光差异。在nuScenes数据集上的验证表明,MagicDrive3D能生成多样化、高质量的三维驾驶场景,支持任意视角渲染,并能提升鸟瞰图分割等下游任务性能。实验结果证明了该框架的卓越性能,展现了其在自动驾驶仿真及其他领域的变革性潜力。