Existing LLM agent systems typically select actions from a fixed and predefined set at every step. While this approach is effective in closed, narrowly-scoped environments, we argue that it presents two major challenges when deploying LLM agents in real-world scenarios: (1) selecting from a fixed set of actions significantly restricts the planning and acting capabilities of LLM agents, and (2) this approach requires substantial human effort to enumerate and implement all possible actions, which becomes impractical in complex environments with a vast number of potential actions. In this work, we propose an LLM agent framework that enables the dynamic creation and composition of actions in an online manner. In this framework, the agent interacts with the environment by generating and executing programs written in a general-purpose programming language at each step. Furthermore, generated actions are accumulated over time for future reuse. Our extensive experiments on the GAIA benchmark demonstrate that this framework offers significantly greater flexibility and outperforms previous methods. Notably, it allows an LLM agent to recover in scenarios where no relevant action exists in the predefined set or when existing actions fail due to unforeseen edge cases. At the time of writing, we hold the top position on the GAIA public leaderboard. Our code can be found in \href{https://github.com/adobe-research/dynasaur}{https://github.com/adobe-research/dynasaur}.
翻译:现有的大语言模型(LLM)智能体系统通常在每个步骤从一个固定且预定义的动作集合中选择行动。虽然这种方法在封闭、范围狭窄的环境中有效,但我们认为,在现实世界场景中部署LLM智能体时,它带来了两个主要挑战:(1)从固定动作集合中选择显著限制了LLM智能体的规划与行动能力;(2)这种方法需要大量人力来枚举和实现所有可能的动作,这在具有海量潜在动作的复杂环境中变得不切实际。在本工作中,我们提出了一种LLM智能体框架,能够以在线方式动态创建和组合动作。在该框架中,智能体通过在每一步生成并执行用通用编程语言编写的程序来与环境交互。此外,生成的动作会随时间积累以供未来重用。我们在GAIA基准测试上的大量实验表明,该框架提供了显著更高的灵活性,并优于先前的方法。值得注意的是,它允许LLM智能体在预定义集合中不存在相关动作,或现有动作因未预见的边缘情况而失败时进行恢复。截至本文撰写时,我们在GAIA公共排行榜上位居首位。我们的代码可在 \href{https://github.com/adobe-research/dynasaur}{https://github.com/adobe-research/dynasaur} 找到。