While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.
翻译:尽管视觉语言模型(VLMs)通过利用语言模型中蕴含的常识,在端到端自动驾驶领域展现出巨大潜力,但其依赖二维图像线索进行复杂场景理解与决策,为安全性和可靠性带来了关键瓶颈。当前基于图像的方法难以实现精确的度量空间推理与几何推断,导致驾驶策略不可靠。为弥补这一缺陷,我们提出LVLDrive(LiDAR-视觉-语言)——一个通过引入激光雷达点云作为额外输入模态,专门为增强现有VLMs在自动驾驶中三维度量空间理解能力而设计的新型框架。核心挑战在于如何缓解异构三维数据对预训练VLMs造成的灾难性干扰。为此,我们提出渐进融合Q-Former模块,通过渐进式注入激光雷达特征,确保VLM既有知识库的稳定性与完整性。此外,我们构建了空间感知问答(SA-QA)数据集,以显式指导模型掌握高级三维感知与推理能力。在自动驾驶基准测试上的大量实验表明,LVLDrive在场景理解、度量空间感知及可靠驾驶决策方面均优于纯视觉模型。本研究揭示了显式三维度量数据对于构建可信赖的基于VLM的自动驾驶系统具有必要性。