The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird's-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.
翻译:将大语言模型(LLMs)集成到自动驾驶系统中已引起日益增长的研究兴趣,这得益于其强大的推理与语义理解能力,这些能力对于处理复杂决策和长尾场景至关重要。然而,现有方法通常独立地将多视角、多帧图像生成的标记输入LLMs,导致计算冗余且空间一致性受限。这种视觉处理上的割裂阻碍了精确的三维空间推理,并难以维持跨视角的几何连贯性。另一方面,从几何标注任务(如目标检测)中学习到的鸟瞰图(BEV)表示虽能提供空间结构,但缺乏基础视觉编码器所具备的语义丰富性。为弥合这一差距,我们提出了BEVLM,这是一个将空间一致且经过语义蒸馏的BEV表示与LLMs相连接的框架。通过大量实验,我们证明BEVLM通过利用BEV特征作为统一输入,使LLMs能在跨视角驾驶场景中进行更有效的推理,准确率提升46%。此外,通过将LLMs的语义知识蒸馏至BEV表示中,BEVLM在安全关键场景下的闭环端到端驾驶性能显著提升了29%。