A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of chain-of-thought (CoT) modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. DriveVLM-Dual achieves robust spatial understanding and real-time inference speed. Extensive experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the effectiveness of DriveVLM and the enhanced performance of DriveVLM-Dual, surpassing existing methods in complex and unpredictable driving conditions.
翻译:城市环境中的自动驾驶面临的主要障碍是理解复杂与长尾场景,例如具有挑战性的道路状况和微妙的人类行为。我们提出DriveVLM,一种利用视觉语言模型(VLM)增强场景理解与规划能力的自动驾驶系统。DriveVLM集成了独特的链式思维(CoT)模块组合,用于场景描述、场景分析和分层规划。此外,考虑到VLM在空间推理方面的局限性和高昂的计算需求,我们进一步提出DriveVLM-Dual,一种混合系统,它将DriveVLM的优势与传统自动驾驶流水线协同结合。DriveVLM-Dual实现了稳健的空间理解能力和实时推理速度。在nuScenes数据集和我们自建的SUP-AD数据集上进行的大量实验表明,DriveVLM的有效性以及DriveVLM-Dual的增强性能在复杂和不可预测的驾驶条件下超越了现有方法。