In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model, we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos, spanning areas all over the world with diverse weather conditions and traffic scenarios. Inheriting the merits from recent latent diffusion models, our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. We showcase that it can generalize to various unseen driving datasets in a zero-shot manner, surpassing general or driving-specific video prediction counterparts. Furthermore, GenAD can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.
翻译:本文提出自动驾驶领域首个大规模视频预测模型。为突破高成本数据采集的限制并增强模型泛化能力,我们从互联网获取海量数据,并为其配对各领域的高质量文本描述。所构建数据集包含超过2000小时行车视频,覆盖全球不同地区及多样化天气条件与交通场景。该模型(称为GenAD)继承近期潜在扩散模型的优势,通过新型时序推理模块处理驾驶场景中的动态挑战。实验表明,该模型能以零样本方式泛化至多种未见过的驾驶数据集,性能超越通用或专用驾驶视频预测模型。此外,GenAD可适配为动作条件预测模型或运动规划器,在真实驾驶应用中具有巨大潜力。