Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.
翻译:视觉问答(VQA)是评估视觉语言模型(VLM)能力的核心任务。现有的VQA基准数据集主要包含清晰明确的图像-问题对,而现实场景往往涉及不同程度的模糊性,需要细致的推理和符合情境的响应策略。尽管近期研究已开始关注VQA中的模糊性问题,但它们缺乏(1)对模糊性等级的系统性分类,以及(2)支持策略感知响应的数据集和模型。本文提出模糊视觉问答(AQuA),这是一个细粒度数据集,根据模糊性的性质与程度将模糊VQA实例划分为四个等级,并为每种情况标注了最优响应策略。通过对多种开源与专有VLM的评估,我们发现大多数模型无法根据模糊类型调整策略,常常生成过度自信的答案,而非寻求澄清或承认不确定性。为解决这一挑战,我们在AQuA上对VLM进行微调,使其能够自适应地选择多种响应策略,例如直接回答、根据上下文线索推断意图、列出合理备选方案或请求澄清。基于AQuA训练的VLM实现了针对模糊VQA的策略性响应生成,展现出识别模糊性、管理不确定性以及采用情境适配策略进行响应的能力,同时在性能上超越了开源与闭源基线模型。