The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work proposes a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D before feeding them into an LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic lanes), providing a condensed world model for perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. Extensive studies show the effectiveness of the proposed architecture as well as the importance of the VQA tasks for reasoning and planning in complex 3D scenes.
翻译:多模态大语言模型(MLLMs)的进展引发了学界对基于LLM的自动驾驶智能体(利用其强大推理能力)的日益关注。然而,如何充分利用MLLMs的强推理能力以提升规划行为仍具挑战性,因为规划需要超越二维推理的完整三维态势感知。为解决该问题,我们提出了一种实现智能体模型与三维驾驶任务强对齐的整体化框架。该框架首先采用新颖的三维MLLM架构,通过稀疏查询将视觉表征提升并压缩至三维空间,随后将其馈入大语言模型。这种基于查询的表征使我们能够联合编码动态对象与静态地图元素(如车道线),从而构建用于三维感知-动作对齐的压缩世界模型。我们进一步提出OmniDrive-nuScenes数据集,这一新的视觉问答数据集通过涵盖场景描述、交通规则、三维定位、反事实推理、决策制定与规划等综合视觉问答任务,挑战模型的真实三维态势感知能力。大量研究表明,所提架构的有效性以及视觉问答任务在复杂三维场景推理与规划中的关键作用。