Large Vision-Language Models (LVLMs), due to the remarkable visual reasoning ability to understand images and videos, have received widespread attention in the autonomous driving domain, which significantly advances the development of interpretable end-to-end autonomous driving. However, current evaluations of LVLMs primarily focus on the multi-faceted capabilities in common scenarios, lacking quantifiable and automated assessment in autonomous driving contexts, let alone severe road corner cases that even the state-of-the-art autonomous driving perception systems struggle to handle. In this paper, we propose CODA-LM, a novel vision-language benchmark for self-driving, which provides the first automatic and quantitative evaluation of LVLMs for interpretable autonomous driving including general perception, regional perception, and driving suggestions. CODA-LM utilizes the texts to describe the road images, exploiting powerful text-only large language models (LLMs) without image inputs to assess the capabilities of LVLMs in autonomous driving scenarios, which reveals stronger alignment with human preferences than LVLM judges. Experiments demonstrate that even the closed-sourced commercial LVLMs like GPT-4V cannot deal with road corner cases well, suggesting that we are still far from a strong LVLM-powered intelligent driving agent, and we hope our CODA-LM can become the catalyst to promote future development.
翻译:大型视觉语言模型(LVLMs)凭借其理解图像与视频的卓越视觉推理能力,在自动驾驶领域备受关注,显著推动了可解释端到端自动驾驶技术的发展。然而,当前对LVLMs的评估主要集中于常见场景下的多维度能力,缺乏面向自动驾驶场景的量化自动评估,更遑论即使最先进的自动驾驶感知系统也难以应对的严重道路极端场景。本文提出自动驾驶场景下首个基于视觉语言的新基准CODA-LM,实现对LVLMs在可解释自动驾驶(包括全局感知、区域感知与驾驶建议)中的自动化定量评估。CODA-LM通过文本描述道路图像,利用无需图像输入但功能强大的纯文本大语言模型(LLMs)评估LVLMs在自动驾驶场景中的能力,其评估结果相比LVLM判别器更符合人类偏好。实验表明,即便是GPT-4V等闭源商用LVLMs也难以妥善处理道路极端场景,这表明我们距离构建强力的LVLM驱动的智能驾驶代理仍有显著差距。我们期待CODA-LM能够成为推动未来发展的催化剂。