Large Language Models (LLMs) have achieved impressive results in knowledge-based Visual Question Answering (VQA). However existing methods still have challenges: the inability to use external tools autonomously, and the inability to work in teams. Humans tend to know whether they need to use external tools when they encounter a new question, e.g., they tend to be able to give a direct answer to a familiar question, whereas they tend to use tools such as search engines when they encounter an unfamiliar question. In addition, humans also tend to collaborate and discuss with others to get better answers. Inspired by this, we propose the multi-agent voting framework. We design three LLM-based agents that simulate different levels of staff in a team, and assign the available tools according to the levels. Each agent provides the corresponding answer, and finally all the answers provided by the agents are voted to get the final answer. Experiments on OK-VQA and A-OKVQA show that our approach outperforms other baselines by 2.2 and 1.0, respectively.
翻译:大语言模型(LLMs)在基于知识的视觉问答(VQA)任务中已取得令人瞩目的成果。然而,现有方法仍面临两大挑战:无法自主调用外部工具,以及无法以团队协作方式工作。人类在面对新问题时,往往能够判断是否需要借助外部工具——例如,对熟悉的问题倾向于直接给出答案,而对陌生问题则倾向于使用搜索引擎等工具。此外,人类也倾向于通过协作讨论来获得更优答案。受此启发,我们提出了多智能体投票框架。我们设计了三个基于大语言模型的智能体,分别模拟团队中不同层级的成员,并根据层级分配可用工具。每个智能体提供相应答案,最终通过投票机制汇总所有智能体的答案以得到最终结果。在OK-VQA和A-OKVQA数据集上的实验表明,我们的方法分别以2.2和1.0的显著优势超越了其他基线模型。