ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning

Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon tasks. To address these limitations, we present ExACT, an approach to combine test-time search and self-learning to build o1-like models for agentic applications. We first introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test time algorithm designed to enhance AI agents' ability to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate for reliable state evaluation. Next, we introduce Exploratory Learning, a novel learning strategy to teach agents to search at inference time without relying on any external search algorithms. On the challenging VisualWebArena benchmark, our GPT-4o based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge and experience gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. After Exploratory Learning, GPT-4o 1) demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success, and 2) matches 87% of R-MCTS's performance while using significantly less compute. Notably, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' capabilities for agentic applications via test-time search and self-learning.

翻译：自主智能体在自动化复杂多步决策任务方面展现出巨大潜力。然而，即使是GPT-4o等最先进的视觉语言模型，在复杂网络环境和长周期任务中仍难以达到人类水平。为突破这些限制，我们提出ExACT方法，通过结合测试时搜索与自学习来构建适用于智能体应用的类o1模型。我们首先提出反思性蒙特卡洛树搜索——一种创新的测试时算法，旨在增强AI智能体实时探索决策空间的能力。R-MCTS通过以下方式扩展传统MCTS：1）引入对比反思机制，使智能体能够从历史交互中学习并动态提升搜索效率；2）采用多智能体辩论实现可靠的状态评估。其次，我们提出探索性学习——一种创新的学习策略，教导智能体在推理时自主进行搜索而无需依赖外部搜索算法。在极具挑战性的VisualWebArena基准测试中，基于GPT-4o的R-MCTS智能体相较先前最优方法，在不同任务中实现了6%至30%的相对性能提升。此外，我们证明通过测试时搜索获得的知识与经验能通过微调有效回传至GPT-4o。经过探索性学习后，GPT-4o展现出以下能力：1）能够探索环境、评估状态，并在检测到当前状态无法导向成功时回溯至可行状态；2）以显著更少的计算量达到R-MCTS 87%的性能水平。值得注意的是，我们的工作展示了在训练（通过R-MCTS收集数据）与测试阶段的计算扩展特性。这些结果表明，通过测试时搜索与自学习来增强视觉语言模型在智能体应用中的能力，是一个极具前景的研究方向。