Multimodal large language models (MLLMs) have emerged as a prominent area of interest within the research community, given their proficiency in handling and reasoning with non-textual data, including images and videos. This study seeks to extend the application of MLLMs to the realm of autonomous driving by introducing DriveGPT4, a novel interpretable end-to-end autonomous driving system based on LLMs. Capable of processing multi-frame video inputs and textual queries, DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users. Furthermore, DriveGPT4 predicts low-level vehicle control signals in an end-to-end fashion. These advanced capabilities are achieved through the utilization of a bespoke visual instruction tuning dataset, specifically tailored for autonomous driving applications, in conjunction with a mix-finetuning training strategy. DriveGPT4 represents the pioneering effort to leverage LLMs for the development of an interpretable end-to-end autonomous driving solution. Evaluations conducted on the BDD-X dataset showcase the superior qualitative and quantitative performance of DriveGPT4. Additionally, the fine-tuning of domain-specific data enables DriveGPT4 to yield close or even improved results in terms of autonomous driving grounding when contrasted with GPT4-V. The code and dataset will be publicly available.
翻译:多模态大语言模型(MLLMs)因其在图像和视频等非文本数据的处理与推理能力,已成为学术界备受关注的研究方向。本研究旨在将MLLMs的应用拓展至自动驾驶领域,提出DriveGPT4——一种基于大语言模型的新型可解释端到端自动驾驶系统。该系统能够处理多帧视频输入与文本查询,支持对车辆行为的解释、提供合理推理解析,并有效回应用户提出的各类问题。更关键的是,DriveGPT4能以端到端方式预测车辆底层控制信号。上述先进功能的实现,得益于专为自动驾驶场景定制的视觉指令微调数据集,以及混合微调训练策略。DriveGPT4率先探索了利用大语言模型构建可解释端到端自动驾驶方案的可能性。在BDD-X数据集上的评估表明,DriveGPT4在定性与定量指标上均展现出优异性能。此外,通过领域数据微调,DriveGPT4可在自动驾驶场景理解任务中取得与GPT4-V相当甚至更优的结果。相关代码与数据集将公开提供。