Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by introducing a world model-based autonomous driving 4D representation learning framework, dubbed \emph{DriveWorld}, which is capable of pre-training from multi-camera driving videos in a spatio-temporal fashion. Specifically, we propose a Memory State-Space Model for spatio-temporal modelling, which consists of a Dynamic Memory Bank module for learning temporal-aware latent dynamics to predict future changes and a Static Scene Propagation module for learning spatial-aware latent statics to offer comprehensive scene contexts. We additionally introduce a Task Prompt to decouple task-aware features for various downstream tasks. The experiments demonstrate that DriveWorld delivers promising results on various autonomous driving tasks. When pre-trained with the OpenScene dataset, DriveWorld achieves a 7.5% increase in mAP for 3D object detection, a 3.0% increase in IoU for online mapping, a 5.0% increase in AMOTA for multi-object tracking, a 0.1m decrease in minADE for motion forecasting, a 3.0% increase in IoU for occupancy prediction, and a 0.34m reduction in average L2 error for planning.
翻译:以视觉为中心的自动驾驶因其低成本而近期受到广泛关注。预训练对于提取通用表征至关重要。然而,当前基于视觉的预训练通常依赖于2D或3D前置任务,忽视了自动驾驶作为4D场景理解任务的时间动态特性。本文通过引入基于世界模型的自动驾驶4D表征学习框架DriveWorld来应对这一挑战,该框架能够以时空方式从多摄像头驾驶视频中进行预训练。具体而言,我们提出了一种用于时空建模的**记忆状态空间模型**,包含**动态记忆库模块**用于学习时间感知的潜在动态以预测未来变化,以及**静态场景传播模块**用于学习空间感知的潜在静态以提供全面的场景上下文。此外,我们引入了**任务提示**来解耦面向不同下游任务的任务感知特征。实验表明,DriveWorld在多种自动驾驶任务中展现出显著效果。当使用OpenScene数据集预训练时,DriveWorld在3D物体检测中mAP提升7.5%,在线地图构建中IoU提升3.0%,多目标跟踪中AMOTA提升5.0%,运动预测中minADE降低0.1米,占据预测中IoU提升3.0%,规划中平均L2误差降低0.34米。