End-to-end autonomous driving has received increasing attention due to its potential to learn from large amounts of data. However, most existing methods are still open-loop and suffer from weak scalability, lack of high-order interactions, and inefficient decision-making. In this paper, we explore a closed-loop framework for autonomous driving and propose a large Driving wOrld modEl (Doe-1) for unified perception, prediction, and planning. We formulate autonomous driving as a next-token generation problem and use multi-modal tokens to accomplish different tasks. Specifically, we use free-form texts (i.e., scene descriptions) for perception and generate future predictions directly in the RGB space with image tokens. For planning, we employ a position-aware tokenizer to effectively encode action into discrete tokens. We train a multi-modal transformer to autoregressively generate perception, prediction, and planning tokens in an end-to-end and unified manner. Experiments on the widely used nuScenes dataset demonstrate the effectiveness of Doe-1 in various tasks including visual question-answering, action-conditioned video generation, and motion planning. Code: https://github.com/wzzheng/Doe.
翻译:端到端自动驾驶因其从海量数据中学习的潜力而受到越来越多的关注。然而,现有方法大多仍处于开环状态,存在可扩展性弱、缺乏高阶交互以及决策效率低下的问题。本文探索了一种闭环自动驾驶框架,并提出了一种用于统一感知、预测与规划的大型驾驶世界模型(Doe-1)。我们将自动驾驶建模为一个下一令牌生成问题,并利用多模态令牌来完成不同任务。具体而言,我们使用自由格式文本(即场景描述)进行感知,并利用图像令牌直接在RGB空间中生成未来预测。对于规划,我们采用一种位置感知的令牌化器,将动作有效地编码为离散令牌。我们训练了一个多模态Transformer模型,以端到端且统一的方式自回归地生成感知、预测和规划令牌。在广泛使用的nuScenes数据集上进行的实验表明,Doe-1在视觉问答、动作条件视频生成和运动规划等多种任务中均表现出有效性。代码:https://github.com/wzzheng/Doe。