A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of chain-of-thought (CoT) modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. DriveVLM-Dual achieves robust spatial understanding and real-time inference speed. Extensive experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the effectiveness of DriveVLM and the enhanced performance of DriveVLM-Dual, surpassing existing methods in complex and unpredictable driving conditions.
翻译:城市环境中自动驾驶的一个主要障碍是理解复杂且长尾的场景,例如具有挑战性的路况和微妙的人类行为。我们提出DriveVLM,这是一个利用视觉-语言模型(VLM)来增强场景理解与规划能力的自动驾驶系统。DriveVLM集成了一套独特的链式思维(CoT)模块,用于场景描述、场景分析和分层规划。此外,针对VLM在空间推理和大量计算需求方面的局限性,我们提出了DriveVLM-Dual——一种将DriveVLM的优势与传统自动驾驶流程协同融合的混合系统。DriveVLM-Dual实现了鲁棒的空间理解和实时推理速度。在nuScenes数据集以及我们自建的SUP-AD数据集上进行的大量实验,证明了DriveVLM的有效性以及DriveVLM-Dual性能的提升,在复杂且不可预测的驾驶条件下超越了现有方法。