Using generative models to synthesize new data has become a de-facto standard in autonomous driving to address the data scarcity issue. Though existing approaches are able to boost perception models, we discover that these approaches fail to improve the performance of planning of end-to-end autonomous driving models as the generated videos are usually less than 8 frames and the spatial and temporal inconsistencies are not negligible. To this end, we propose Delphi, a novel diffusion-based long video generation method with a shared noise modeling mechanism across the multi-views to increase spatial consistency, and a feature-aligned module to achieves both precise controllability and temporal consistency. Our method can generate up to 40 frames of video without loss of consistency which is about 5 times longer compared with state-of-the-art methods. Instead of randomly generating new data, we further design a sampling policy to let Delphi generate new data that are similar to those failure cases to improve the sample efficiency. This is achieved by building a failure-case driven framework with the help of pre-trained visual language models. Our extensive experiment demonstrates that our Delphi generates a higher quality of long videos surpassing previous state-of-the-art methods. Consequentially, with only generating 4% of the training dataset size, our framework is able to go beyond perception and prediction tasks, for the first time to the best of our knowledge, boost the planning performance of the end-to-end autonomous driving model by a margin of 25%.
翻译:利用生成模型合成新数据已成为自动驾驶领域应对数据稀缺问题的标准方法。尽管现有方法能够提升感知模型性能,但我们发现,这些方法未能改善端到端自动驾驶模型的规划能力,因为生成的视频通常少于8帧,且时空不一致性不可忽略。为此,我们提出Delphi——一种基于扩散模型的长视频生成新方法,该方法采用跨多视角的共享噪声建模机制以增强空间一致性,并通过特征对齐模块实现精确可控性与时间一致性。我们的方法可生成多达40帧且保持一致性不损失的视频,长度约为现有最优方法的5倍。不同于随机生成新数据,我们进一步设计了一种采样策略,使Delphi生成与失败案例相似的新数据,以提高样本效率。这一策略通过构建基于预训练视觉语言模型的失败案例驱动框架实现。大量实验表明,Delphi生成的高质量长视频超越了现有最优方法。最终,仅需生成训练数据集规模4%的数据量,据我们所知,本框架首次超越感知与预测任务,将端到端自动驾驶模型的规划性能提升25%。