Large Vision-Language Models (LVLMs) have received widespread attention in advancing the interpretable self-driving. Existing evaluations of LVLMs primarily focus on the multi-faceted capabilities in natural circumstances, lacking automated and quantifiable assessment for self-driving, let alone the severe road corner cases. In this paper, we propose CODA-LM, the very first benchmark for the automatic evaluation of LVLMs for self-driving corner cases. We adopt a hierarchical data structure to prompt powerful LVLMs to analyze complex driving scenes and generate high-quality pre-annotation for human annotators, and for LVLM evaluation, we show that using the text-only large language models (LLMs) as judges reveals even better alignment with human preferences than the LVLM judges. Moreover, with CODA-LM, we build CODA-VLM, a new driving LVLM surpassing all the open-sourced counterparts on CODA-LM. Our CODA-VLM performs comparably with GPT-4V, even surpassing GPT-4V by +21.42% on the regional perception task. We hope CODA-LM can become the catalyst to promote interpretable self-driving empowered by LVLMs.
翻译:大型视觉语言模型(LVLMs)在推动可解释自动驾驶方面受到广泛关注。现有对LVLMs的评估主要集中于其在自然场景下的多方位能力,缺乏针对自动驾驶的自动化、可量化评估,更遑论对严峻道路极端场景的评测。本文提出CODA-LM,首个面向自动驾驶极端场景的LVLM自动化评估基准。我们采用分层数据结构提示强大的LVLMs分析复杂驾驶场景,为人工标注者生成高质量预标注;对于LVLM评估,我们发现使用纯文本大语言模型(LLMs)作为评判者比LVLM评判者能实现与人类偏好更优的对齐。此外,基于CODA-LM,我们构建了CODA-VLM,这是一个新的驾驶专用LVLM,其在CODA-LM上的表现超越了所有开源同类模型。我们的CODA-VLM与GPT-4V性能相当,甚至在区域感知任务上以+21.42%的优势超越GPT-4V。我们希望CODA-LM能够成为推动LVLMs赋能可解释自动驾驶发展的催化剂。