Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.
翻译:近期端到端自动驾驶方法利用视觉-语言模型(VLMs)来增强复杂驾驶场景中的规划能力。然而,VLM本质上是作为通用模型训练的,缺乏对三维时空驾驶专用推理的专业理解。当应用于自动驾驶时,这些模型难以建立结构化的时空表征来捕捉几何关系、场景上下文以及对安全轨迹规划至关重要的运动模式。为解决这些局限性,我们提出SGDrive,一种新颖的框架,围绕驾驶专用知识层次结构显式地构建VLM的表征学习。基于预训练的VLM主干构建,SGDrive将驾驶理解分解为模仿人类驾驶认知的场景-智能体-目标层次结构:驾驶员首先感知整体环境(场景上下文),然后关注安全关键智能体及其行为,最后在执行动作前制定短期目标。这种层次分解提供了通用VLM所缺乏的结构化时空表征,将多层级信息整合为紧凑而全面的格式以用于轨迹规划。在NAVSIM基准上的大量实验表明,SGDrive在仅使用摄像头的方法中,于PDMS和EPDMS上均实现了最先进的性能,验证了层次化知识结构化在将通用VLM适配于自动驾驶方面的有效性。