In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.
翻译:本文提出了一种自主信息搜寻的视觉问答框架AVIS。该方法利用大语言模型动态制定外部工具的使用策略,并分析其输出结果,从而获取回答所提问题所需的关键知识。针对需要外部知识的视觉问题(例如“图中建筑纪念的是哪一事件?”),该任务涉及组合搜索空间,需要执行一系列动作,包括调用API、分析响应及做出决策。我们通过用户研究收集了人类在此任务中的多种决策实例,并据此设计了包含三个组件的系统:由大语言模型驱动的规划器(动态决定下一步调用何种工具)、由大语言模型驱动的推理器(分析并提取工具输出的关键信息)以及工作记忆组件(在整个过程中保留已获取的信息)。收集到的用户行为在两方面指导系统:首先,通过分析用户决策序列构建转移图,该图定义不同状态并限定各状态可执行的动作集合;其次,利用用户决策示例为规划器和推理器提供相关上下文实例,增强其决策能力。实验表明,AVIS在Infoseek和OK-VQA等知识密集型视觉问答基准上取得了最先进的性能。