Recently, several methods have been proposed to augment large Vision Language Models (VLMs) for Visual Question Answering (VQA) simplicity by incorporating external knowledge from knowledge bases or visual clues derived from question decomposition. Although having achieved promising results, these methods still suffer from the challenge that VLMs cannot inherently understand the incorporated knowledge and might fail to generate the optimal answers. Contrarily, human cognition engages visual questions through a top-down reasoning process, systematically exploring relevant issues to derive a comprehensive answer. This not only facilitates an accurate answer but also provides a transparent rationale for the decision-making pathway. Motivated by this cognitive mechanism, we introduce a novel, explainable multi-agent collaboration framework designed to imitate human-like top-down reasoning by leveraging the expansive knowledge of Large Language Models (LLMs). Our framework comprises three agents, i.e., Responder, Seeker, and Integrator, each contributing uniquely to the top-down reasoning process. The VLM-based Responder generates the answer candidates for the question and gives responses to other issues. The Seeker, primarily based on LLM, identifies relevant issues related to the question to inform the Responder and constructs a Multi-View Knowledge Base (MVKB) for the given visual scene by leveraging the understanding capabilities of LLM. The Integrator agent combines information from the Seeker and the Responder to produce the final VQA answer. Through this collaboration mechanism, our framework explicitly constructs an MVKB for a specific visual scene and reasons answers in a top-down reasoning process. Extensive and comprehensive evaluations on diverse VQA datasets and VLMs demonstrate the superior applicability and interpretability of our framework over the existing compared methods.
翻译:近期,研究者提出了多种方法,通过整合知识库中的外部信息或基于问题分解得到的视觉线索,来简化大型视觉语言模型(VLM)在视觉问答(VQA)任务中的应用。尽管这些方法取得了令人鼓舞的结果,但仍面临一个挑战:VLM无法内在地理解所整合的知识,可能无法生成最优答案。相比之下,人类认知通过自上而下的推理过程处理视觉问题,系统性地探索相关问题以得出全面答案。这不仅有助于获得准确答案,还为决策路径提供了透明的解释。受这种认知机制的启发,我们提出了一种新颖的可解释多智能体协作框架,旨在通过利用大型语言模型(LLM)的广泛知识,模拟人类自上而下的推理过程。该框架包含三个智能体:响应者(Responder)、探寻者(Seeker)和整合者(Integrator),各自对自上而下的推理过程做出独特贡献。基于VLM的响应者生成问题的候选答案,并对其他问题进行回应;主要基于LLM的探寻者识别与问题相关的子问题以指导响应者,并利用LLM的理解能力为给定视觉场景构建多视角知识库(MVKB);整合者智能体融合探寻者与响应者的信息,生成最终VQA答案。通过这种协作机制,我们的框架显式地为特定视觉场景构建MVKB,并以自上而下的推理过程推导答案。在多种VQA数据集和VLM上的广泛评估表明,该框架相较于现有方法具有更优的适用性和可解释性。