A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of chain-of-thought (CoT) modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. DriveVLM-Dual achieves robust spatial understanding and real-time inference speed. Extensive experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the effectiveness of DriveVLM and the enhanced performance of DriveVLM-Dual, surpassing existing methods in complex and unpredictable driving conditions.
翻译:城市环境下自动驾驶的主要障碍在于理解复杂且长尾的场景,例如具有挑战性的道路条件和微妙的人类行为。我们提出DriveVLM,一种利用视觉语言模型(VLM)增强场景理解与规划能力的自动驾驶系统。DriveVLM整合了一套独特的思维链(CoT)模块,用于场景描述、场景分析和层次化规划。此外,考虑到VLM在空间推理和计算需求方面的局限性,我们提出了DriveVLM-Dual,一种将DriveVLM与传统自动驾驶流程优势相结合的混合系统。DriveVLM-Dual实现了稳健的空间理解和实时推理速度。在nuScenes数据集和我们的SUP-AD数据集上的广泛实验表明,DriveVLM的有效性以及DriveVLM-Dual在复杂且不可预测的驾驶条件下超越现有方法的性能提升。