DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models

The field of AI agents is advancing at an unprecedented rate due to the capabilities of large language models (LLMs). However, LLM-driven visual agents mainly focus on solving tasks for the image modality, which limits their ability to understand the dynamic nature of the real world, making it still far from real-life applications, e.g., guiding students in laboratory experiments and identifying their mistakes. Considering the video modality better reflects the ever-changing and perceptually intensive nature of real-world scenarios, we devise DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to handle dynamic video tasks. Given a video with a question/task, DoraemonGPT begins by converting the input video with massive content into a symbolic memory that stores \textit{task-related} attributes. This structured representation allows for spatial-temporal querying and reasoning by sub-task tools, resulting in concise and relevant intermediate results. Recognizing that LLMs have limited internal knowledge when it comes to specialized domains (e.g., analyzing the scientific principles underlying experiments), we incorporate plug-and-play tools to assess external knowledge and address tasks across different domains. Moreover, we introduce a novel LLM-driven planner based on Monte Carlo Tree Search to efficiently explore the large planning space for scheduling various tools. The planner iteratively finds feasible solutions by backpropagating the result's reward, and multiple solutions can be summarized into an improved final answer. We extensively evaluate DoraemonGPT in dynamic scenes and provide in-the-wild showcases demonstrating its ability to handle more complex questions than previous studies.

翻译：摘要：得益于大语言模型（LLMs）的强大能力，人工智能代理领域正以前所未有的速度发展。然而，基于LLM的视觉代理主要聚焦于图像模态任务，这限制了其对现实世界动态特性的理解能力，使其仍难以应用于真实场景——例如，指导学生进行实验室操作并识别其错误。考虑到视频模态更能反映现实场景的持续变化与高感知密度特性，我们设计了DoraemonGPT——一个由LLM驱动的、概念优雅的综合性系统，用于处理动态视频任务。给定一段视频及其关联问题/任务，DoraemonGPT首先将包含海量内容的输入视频转化为存储“任务相关”属性的符号记忆。这种结构化表示允许通过子任务工具进行时空查询与推理，从而生成简洁且相关的中间结果。针对LLM在专业领域（如分析实验背后的科学原理）存在知识局限的问题，我们引入即插即用工具以调用外部知识，实现跨领域任务处理。此外，我们提出一种基于蒙特卡洛树搜索的新型LLM驱动规划器，高效探索调度多种工具的巨大规划空间。该规划器通过反向传播结果奖励迭代寻找可行解，并将多个解归纳为改进的最终答案。我们在动态场景中对DoraemonGPT进行了广泛评估，并通过真实场景案例展示其处理比以往研究更复杂问题的能力。