We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems to boost generalization and enable interactivity with human users. While recent approaches adapt VLMs to driving via single-round visual question answering (VQA), human drivers reason about decisions in multiple steps. Starting from the localization of key objects, humans estimate object interactions before taking actions. The key insight is that with our proposed task, Graph VQA, where we model graph-structured reasoning through perception, prediction and planning question-answer pairs, we obtain a suitable proxy task to mimic the human reasoning process. We instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving. The experiments demonstrate that Graph VQA provides a simple, principled framework for reasoning about a driving scene, and DriveLM-Data provides a challenging benchmark for this task. Our DriveLM-Agent baseline performs end-to-end autonomous driving competitively in comparison to state-of-the-art driving-specific architectures. Notably, its benefits are pronounced when it is evaluated zero-shot on unseen objects or sensor configurations. We hope this work can be the starting point to shed new light on how to apply VLMs for autonomous driving. To facilitate future research, all code, data, and models are available to the public.
翻译:我们研究如何将在大规模网络数据上训练的视觉-语言模型(VLM)集成到端到端驾驶系统中,以提升泛化能力并实现与人类用户的交互。尽管近期方法通过单轮视觉问答(VQA)将VLM适配到驾驶场景,但人类驾驶员在决策过程中会进行多步推理。从关键目标的定位开始,人类在采取行动前会先评估目标之间的交互关系。核心洞见在于:通过我们提出的图视觉问答(Graph VQA)任务——将感知、预测与规划环节的问答对建模为图结构推理——能够获得模拟人类推理过程的理想代理任务。我们基于nuScenes和CARLA构建了数据集(DriveLM-Data),并提出一种基于VLM的基线方法(DriveLM-Agent),用于联合执行Graph VQA与端到端驾驶。实验表明,Graph VQA为驾驶场景推理提供了简洁且原则性的框架,而DriveLM-Data为该任务设立了具有挑战性的基准。我们的DriveLM-Agent基线在端到端自动驾驶性能上与最先进的驾驶专用架构相当。尤其值得注意的是,当在未见过的目标或传感器配置上进行零样本评估时,其优势更为显著。我们希望这项工作能成为揭示如何将VLM应用于自动驾驶的新起点。为促进后续研究,所有代码、数据集及模型均已开源。