8 years after the visual question answering (VQA) task was proposed, accuracy remains the primary metric for automatic evaluation. VQA Accuracy has been effective so far in the IID evaluation setting. However, our community is undergoing a shift towards open-ended generative models and OOD evaluation. In this new paradigm, the existing VQA Accuracy metric is overly stringent and underestimates the performance of VQA systems. Thus, there is a need to develop more robust automatic VQA metrics that serve as a proxy for human judgment. In this work, we propose to leverage the in-context learning capabilities of instruction-tuned large language models (LLMs) to build a better VQA metric. We formulate VQA evaluation as an answer-rating task where the LLM is instructed to score the accuracy of a candidate answer given a set of reference answers. We demonstrate the proposed metric better correlates with human judgment compared to existing metrics across several VQA models and benchmarks. We hope wide adoption of our metric will contribute to better estimating the research progress on the VQA task.
翻译:自视觉问答(VQA)任务提出8年以来,准确率(Accuracy)始终是自动评估的主要指标。在独立同分布(IID)评估环境下,VQA准确率迄今表现有效。然而,该领域正经历向开放式生成模型与分布外(OOD)评估的范式转变。在此新范式下,现有VQA准确率指标过于严苛,低估了VQA系统的性能。因此,亟需开发更稳健的自动VQA指标,使其能有效替代人工评判。本研究提出利用经过指令微调的大型语言模型(LLMs)的上下文学习能力,构建更优的VQA评估指标。我们将VQA评估形式化为答案评分任务:通过指令引导LLM基于一组参考答案对候选答案的准确率进行评分。实验表明,与现有指标相比,本提出的指标在多个VQA模型和基准测试中与人工评判的相关性更高。我们期待该指标的广泛采用将有助于更准确地估测VQA任务的研究进展。