AVIS: Autonomous Visual Information Seeking with Large Language Model Agent

In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.

翻译：本文提出了一种自主信息检索视觉问答框架AVIS。该方法利用大语言模型动态制定外部工具使用策略，并深入研究其输出结果，从而获取回答问题所需的关键知识。针对需借助外部知识的视觉问题（例如“图中建筑纪念的是何种事件？”）进行回答是一项复杂任务。该任务具有组合搜索空间特性，需要执行包括调用API、分析响应结果及做出决策判断在内的一系列操作。我们通过用户研究收集了人类在此类任务中多样化的决策实例，并基于这些数据设计了三模块系统：基于大语言模型的规划器（动态确定下一步使用的工具）、基于大语言模型的推理器（分析并提取工具输出中的关键信息），以及贯穿全过程的信息保持工作记忆模块。用户行为数据在两方面引导系统运作：首先，通过分析用户决策序列构建状态转移图，界定不同状态及每个状态可用的行动集合；其次，利用用户决策范例为规划器和推理器提供相关上下文实例，提升其决策能力。实验表明，AVIS在Infoseek、OK-VQA等知识密集型视觉问答基准测试中取得了最先进的性能。