Recent LLM-driven visual agents mainly focus on solving image-based tasks, which limits their ability to understand dynamic scenes, making it far from real-life applications like guiding students in laboratory experiments and identifying their mistakes. Considering the video modality better reflects the ever-changing nature of real-world scenarios, we devise DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to handle dynamic video tasks. Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes. This structured representation allows for spatial-temporal querying and reasoning by well-designed sub-task tools, resulting in concise intermediate results. Recognizing that LLMs have limited internal knowledge when it comes to specialized domains (e.g., analyzing the scientific principles underlying experiments), we incorporate plug-and-play tools to assess external knowledge and address tasks across different domains. Moreover, a novel LLM-driven planner based on Monte Carlo Tree Search is introduced to explore the large planning space for scheduling various tools. The planner iteratively finds feasible solutions by backpropagating the result's reward, and multiple solutions can be summarized into an improved final answer. We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and challenging in-the-wild scenarios. Code will be released at: https://github.com/z-x-yang/DoraemonGPT.
翻译:近期基于大语言模型的视觉智能体主要聚焦于解决图像任务,这限制了其对动态场景的理解能力,使其难以应用于诸如指导学生实验操作并识别其错误等真实场景。鉴于视频模态更能反映现实场景的持续变化特性,我们设计了DoraemonGPT——一套由大语言模型驱动、架构完整且概念优雅的动态视频任务处理系统。给定包含问题/任务的视频,DoraemonGPT首先将输入视频转化为存储任务相关属性的符号记忆。这种结构化表征允许通过精心设计的子任务工具进行时空查询与推理,从而获得精炼的中间结果。考虑到大语言模型在专业领域(如分析实验背后的科学原理)存在内部知识局限性,我们融入即插即用的外部知识评估工具,支持跨领域任务处理。此外,本文提出基于蒙特卡洛树搜索的创新性大语言模型驱动规划器,以探索调度多种工具的大规模规划空间。该规划器通过反向传播结果奖励值迭代寻找可行解,并可将多个解决方案归纳为改进的最终答案。我们在三个基准测试及具有挑战性的野外场景中全面验证了DoraemonGPT的有效性。代码将发布于:https://github.com/z-x-yang/DoraemonGPT。