Large language models can perform well on many isolated tasks, yet they continue to struggle on multi-turn, long-horizon agentic problems that require skills such as planning, state tracking, and long context processing. In this work, we aim to better understand the relative importance of advancing these underlying capabilities for success on such tasks. We develop an oracle counterfactual framework for multi-turn problems that asks: how would an agent perform if it could leverage an oracle to perfectly perform a specific task? The change in the agent's performance due to this oracle assistance allows us to measure the criticality of such oracle skill in the future advancement of AI agents. We introduce a suite of procedurally generated, game-like tasks with tunable complexity. These controlled environments allow us to provide precise oracle interventions, such as perfect planning or flawless state tracking, and make it possible to isolate the contribution of each oracle without confounding effects present in real-world benchmarks. Our results show that while some interventions (e.g., planning) consistently improve performance across settings, the usefulness of other skills is dependent on the properties of the environment and language model. Our work sheds light on the challenges of multi-turn agentic environments to guide the future efforts in the development of AI agents and language models.
翻译:大型语言模型在许多独立任务上表现优异,但在需要规划、状态追踪和长上下文处理等技能的多轮长程智能体任务中仍面临困难。本研究旨在深入理解提升这些底层能力对此类任务成功的相对重要性。我们为多轮问题开发了一种基于反事实的预言机框架,其核心问题是:若智能体能够借助预言机完美执行特定任务,其表现将如何变化?这种预言机辅助带来的性能改变,使我们能够衡量该预言机技能对未来人工智能智能体发展的关键性。我们引入了一套可调节复杂度的程序化生成类游戏任务,这些受控环境允许我们提供精确的预言机干预(例如完美规划或无瑕疵状态追踪),从而能够在排除现实基准中混杂效应的情况下,独立评估每个预言机的贡献。研究结果表明:虽然某些干预措施(如规划)在不同场景中能持续提升性能,但其他技能的有效性取决于环境特性与语言模型本身。本工作揭示了多轮智能体环境面临的挑战,为未来人工智能智能体与语言模型的发展方向提供了指引。