Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.
翻译:视觉语言模型(VLM)在视觉问答(VQA)任务中已展现出强劲性能。为缓解个体幻觉与盲区问题,通过多智能体协作汇聚多元视角已成为一种具有前景的范式。尽管该方法在文本问答领域取得显著成功,但其在多模态领域的潜力仍待深入探索。现有VQA多智能体方法主要沿袭文本中心协议,聚焦文本讨论而忽略视觉信息对齐。本文揭示关键见解:仅凭答案层面的一致性不足以保证可靠的VQA多智能体推理——必须基于对齐的视觉证据(即智能体所依赖图像区域的共享支撑)方能构建可信共识。基于此洞察,我们提出EAGLE(证据对齐的接地多智能体推理)框架——一种无需训练的以证据为中心的多VLM智能体协调方案。EAGLE显式揭示各智能体接地区域作为视觉证据,实现证据层面的互证校验,并依据证据一致性指导最终决策。在六个VQA基准上的实验表明:EAGLE在保持轻量、可解释与工程部署实用性的前提下,跨领域取得了最优平均性能。