We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark as an evaluation platform for vision-language models in autonomous driving.
翻译:本文介绍了LingoQA——一个面向自动驾驶场景的视觉问答新型数据集与基准测试平台。该数据集包含2.8万个独立短视频场景及41.9万条标注数据。通过对现有先进视觉语言模型进行基准测试,发现其性能显著低于人类水平:GPT-4V对问题的真实回答率为59.6%,而人类达到96.6%。在评估方法上,我们提出了名为Lingo-Judge的真实性分类器,其与人工评估的斯皮尔曼相关系数达到0.95,超越了METEOR、BLEU、CIDEr及GPT-4等现有技术。我们构建了基线视觉语言模型并通过大量消融实验解析其性能表现。本数据集与基准测试平台已开源,旨在为自动驾驶领域的视觉语言模型提供评估框架。