A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.
翻译:城市环境中自动驾驶的主要障碍在于理解复杂且长尾的场景,例如具有挑战性的路况和微妙的人类行为。本文提出DriveVLM,一种利用视觉语言模型(VLMs)增强场景理解与规划能力的自动驾驶系统。DriveVLM集成了场景描述、场景分析和分层规划的独特推理模块组合。进一步地,针对VLMs在空间推理方面的局限性和高计算需求,我们提出了DriveVLM-Dual混合系统,其将DriveVLM与传统自动驾驶流程的优势相协同。在nuScenes数据集及我们构建的SUP-AD数据集上的实验表明,DriveVLM与DriveVLM-Dual在处理复杂不可预测的驾驶条件方面具有显著效能。最后,我们将DriveVLM-Dual部署于量产车辆,验证了其在真实世界自动驾驶环境中的有效性。