The quest for fully autonomous vehicles (AVs) capable of navigating complex real-world scenarios with human-like understanding and responsiveness. In this paper, we introduce Dolphins, a novel vision-language model architected to imbibe human-like abilities as a conversational driving assistant. Dolphins is adept at processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instructions. Building upon the open-sourced pretrained Vision-Language Model, OpenFlamingo, we first enhance Dolphins's reasoning capabilities through an innovative Grounded Chain of Thought (GCoT) process. Then we tailored Dolphins to the driving domain by constructing driving-specific instruction data and conducting instruction tuning. Through the utilization of the BDD-X dataset, we designed and consolidated four distinct AV tasks into Dolphins to foster a holistic understanding of intricate driving scenarios. As a result, the distinctive features of Dolphins are characterized into two dimensions: (1) the ability to provide a comprehensive understanding of complex and long-tailed open-world driving scenarios and solve a spectrum of AV tasks, and (2) the emergence of human-like capabilities including gradient-free instant adaptation via in-context learning and error recovery via reflection.
翻译:为实现全自动驾驶汽车(AV)在复杂真实场景中具备类人理解与响应能力,本文提出海豚(Dolphins)——一种新型视觉-语言模型架构,旨在赋予其类人能力作为对话式驾驶助手。该模型擅长处理包含视频(或图像)数据、文本指令及历史控制信号的多模态输入,并根据给定指令生成信息性输出。基于开源预训练视觉-语言模型OpenFlamingo,我们首先通过创新性的锚定思维链(GCoT)过程增强海豚的推理能力,随后构建驾驶专用指令数据并实施指令微调,将模型适配至驾驶领域。通过利用BDD-X数据集,我们将四项不同的自动驾驶任务整合至海豚中,以促进其对复杂驾驶场景的整体理解。最终,海豚的独特特征体现在两个维度:(1)具备对复杂长尾开放世界驾驶场景的全面理解能力,可解决一系列自动驾驶任务;(2)涌现类人能力,包括通过上下文学习实现无梯度即时适应,以及通过反思实现错误恢复。