Real-world visual question answering (VQA) is often context-dependent: an image-question pair may be under-specified, such that the correct answer depends on external information that is not observable in the image. In such cases, directly answering can lead to confident but incorrect predictions. We propose CoA(Clarify-or-Answer), an ask-or-answer agent that separately models the decision to ask or answer, and what to ask if needed. CoA first determines whether clarification is necessary; if so, it asks a single focused question and then incorporates the response to produce the final answer. We introduce CONTEXTCLARIFY with a set of ambiguous VQA questions and the contrast set that is non-ambiguous. We further introduce GRPO-CR (Clarification Reasoning), a reinforcement learning approach that optimizes clarification question generation with multiple reward signals encouraging well-formed, focused, non-trivial questions that resolve ambiguity. Across three VLLMs and three datasets, CoA achieves consistent improvements at both the module and system levels, improving end-to-end VQA accuracy by an average of +15.3 points (83%) over prompting-based baselines
翻译:现实世界中的视觉问答(VQA)通常依赖于上下文:一个图像-问题对可能欠明确,使得正确答案取决于图像中不可观测的外部信息。在此类情况下,直接回答可能导致自信但错误的预测。我们提出CoA(澄清或回答),一种询问或回答的智能体,它分别建模决定询问或回答的决策,以及在需要时询问什么内容。CoA首先判断澄清是否必要;若必要,则提出一个聚焦的单一问题,随后整合回应以生成最终答案。我们引入CONTEXTCLARIFY数据集,包含一组模糊的VQA问题及其对应的非模糊对比集。我们进一步提出GRPO-CR(澄清推理),一种强化学习方法,通过多重奖励信号优化澄清问题的生成,鼓励形成结构良好、聚焦、非平凡且能消除歧义的问题。在三个VLLM和三个数据集上的实验表明,CoA在模块和系统层面均取得了一致的性能提升,端到端VQA准确率相较于基于提示的基线方法平均提高了+15.3个百分点(提升83%)。