Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are $\mathbf{really\ crucial}$ for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Page: https://wm-research.github.io/Dream4Drive/ GitHub Link: https://github.com/wm-research/Dream4Drive
翻译:近期驾驶世界模型的进展实现了高质量RGB视频或多模态视频的可控生成。现有方法主要关注与生成质量和可控性相关的指标,但往往忽视了对下游感知任务的评估,而这对于自动驾驶的性能至关重要。现有方法通常采用先在合成数据上预训练、再在真实数据上微调的训练策略,导致训练周期数达到基准方法(仅使用真实数据)的两倍。当我们将基准方法的训练周期数加倍时,合成数据的优势变得微乎其微。为充分证明合成数据的价值,我们提出了Dream4Drive——一个专为增强下游感知任务设计的新型合成数据生成框架。Dream4Drive首先将输入视频分解为多个三维感知引导图,随后将三维资产渲染至这些引导图上。最后,通过微调驾驶世界模型生成经过编辑的多视角逼真视频,这些视频可用于训练下游感知模型。Dream4Drive实现了大规模生成多视角极端场景的突破性灵活性,显著提升了自动驾驶中的极端场景感知能力。为促进未来研究,我们还贡献了名为DriveObj3D的大规模三维资产数据集,涵盖驾驶场景中的典型类别,支持多样化的三维感知视频编辑。通过全面实验表明,Dream4Drive能在不同训练周期数下有效提升下游感知模型的性能。项目主页:https://wm-research.github.io/Dream4Drive/ GitHub链接:https://github.com/wm-research/Dream4Drive