DARE: Diverse Visual Question Answering with Robustness Evaluation

Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models, and are able to learn from and process multi-modal vision-text input. While modern VLMs perform well on a number of standard image classification and image-text matching tasks, they still struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning. Moreover, while they might be very brittle to small variations in instructions and/or evaluation protocols, existing benchmarks fail to evaluate their robustness (or rather the lack of it). In order to couple challenging VL scenarios with comprehensive robustness evaluation, we introduce DARE, Diverse Visual Question Answering with Robustness Evaluation, a carefully created and curated multiple-choice VQA benchmark. DARE evaluates VLM performance on five diverse categories and includes four robustness-oriented evaluations based on the variations of: prompts, the subsets of answer options, the output format and the number of correct answers. Among a spectrum of other findings, we report that state-of-the-art VLMs still struggle with questions in most categories and are unable to consistently deliver their peak performance across the tested robustness evaluations. The worst case performance across the subsets of options is up to 34% below the performance in the standard case. The robustness of the open-source VLMs such as LLaVA 1.6 and Idefics2 cannot match the closed-source models such as GPT-4 and Gemini, but even the latter remain very brittle to different variations.

翻译：视觉语言模型（VLMs）扩展了纯文本大语言模型和纯视觉模型的卓越能力，能够学习并处理多模态的视觉-文本输入。尽管现代VLMs在多项标准图像分类和图文匹配任务上表现良好，但在计数、空间推理等若干关键的视觉语言推理能力方面仍存在困难。此外，尽管它们可能对指令和/或评估协议的微小变化极为敏感，现有基准测试却未能有效评估其鲁棒性（或更确切地说，其鲁棒性的缺乏）。为了将具有挑战性的视觉语言场景与全面的鲁棒性评估相结合，我们提出了DARE（面向鲁棒性评估的多样化视觉问答），这是一个精心构建和筛选的多选视觉问答基准。DARE在五个多样化类别上评估VLM性能，并包含基于以下四种变体的鲁棒性导向评估：提示词、答案选项子集、输出格式以及正确答案数量。在一系列其他发现中，我们报告指出，最先进的VLMs在大多数类别的问题上仍存在困难，且无法在测试的鲁棒性评估中持续发挥其峰值性能。在选项子集上的最差性能比标准情况下的性能低达34%。开源VLMs（如LLaVA 1.6和Idefics2）的鲁棒性无法与闭源模型（如GPT-4和Gemini）相匹敌，但即使是后者，对不同变体仍表现出高度脆弱性。