We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark https://github.com/wayveai/LingoQA as an evaluation platform for vision-language models in autonomous driving.
翻译:本文介绍LingoQA——一个面向自动驾驶场景的视觉问答创新数据集与基准平台。该数据集包含2.8万个独立短时视频场景及41.9万条标注数据。通过对当前最先进的视觉语言模型进行基准测试,发现其性能显著低于人类水平:GPT-4V仅能对59.6%的问题给出准确回答,而人类准确率高达96.6%。为建立评估体系,我们提出名为Lingo-Judge的真实性分类器,其与人工评估的斯皮尔曼相关系数达到0.95,超越了METEOR、BLEU、CIDEr及GPT-4等现有评估技术。我们构建了基线视觉语言模型并通过大量消融实验解析其性能表现。本数据集与基准平台(https://github.com/wayveai/LingoQA)已开源,旨在为自动驾驶领域的视觉语言模型提供标准化评估体系。