Typically, autonomous driving adopts a modular design, which divides the full stack into perception, prediction, planning and control parts. Though interpretable, such modular design tends to introduce a substantial amount of redundancy. Recently, multimodal large language models (MLLM) and diffusion techniques have demonstrated their superior performance on comprehension and generation ability. In this paper, we first introduce the concept of interleaved vision-action pair, which unifies the format of visual features and control signals. Based on the vision-action pairs, we construct a general world model based on MLLM and diffusion model for autonomous driving, termed ADriver-I. It takes the vision-action pairs as inputs and autoregressively predicts the control signal of the current frame. The generated control signals together with the historical vision-action pairs are further conditioned to predict the future frames. With the predicted next frame, ADriver-I performs further control signal prediction. Such a process can be repeated infinite times, ADriver-I achieves autonomous driving in the world created by itself. Extensive experiments are conducted on nuScenes and our large-scale private datasets. ADriver-I shows impressive performance compared to several constructed baselines. We hope our ADriver-I can provide some new insights for future autonomous driving and embodied intelligence.
翻译:通常,自动驾驶采用模块化设计,将全栈功能划分为感知、预测、规划和控制模块。尽管这种模块化设计具有可解释性,但容易引入大量冗余。近年来,多模态大语言模型(MLLM)和扩散技术在理解与生成能力上展现出卓越性能。本文首先提出交错视觉-动作对的概念,统一了视觉特征与控制信号的格式。基于视觉-动作对,我们构建了一个基于MLLM和扩散模型的自动驾驶通用世界模型,命名为ADriver-I。该模型以视觉-动作对为输入,自回归地预测当前帧的控制信号。生成的控信号与历史视觉-动作对共同作为条件,用于预测未来帧。基于预测的下一帧,ADriver-I进一步执行控制信号预测。该过程可无限循环,使ADriver-I在其自身创造的世界中实现自动驾驶。我们在nuScenes及大规模私有数据集上进行了大量实验。与多个基线模型相比,ADriver-I展现出令人瞩目的性能。我们希望ADriver-I能为未来自动驾驶和具身智能提供新思路。