EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like "clever but clueless interns" in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state-action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.

翻译：当前AI智能体的一个根本局限在于无法在测试时动态学习复杂技能，在新环境中往往表现得像"聪明却无知的实习生"，这严重限制了其实际应用价值。为系统性地衡量并推动该挑战的进展，我们首先提出了Jericho测试时学习（J-TTL）基准。J-TTL是一种新型评估框架，要求智能体在连续多个回合中玩同一款游戏，并尝试在回合间持续提升表现。在J-TTL基准测试中，我们发现现有适应方法（如反思、记忆或强化学习）均表现不佳。为应对基准测试提出的挑战，我们提出了EvoTest——一种进化式测试时学习框架，该框架通过每回合后进化整个智能体系统来实现智能体优化，无需任何微调或梯度更新。EvoTest包含双重角色：执行游戏操作的执行智能体（Actor Agent），以及分析回合记录并为下一轮运行提出修订配置的进化智能体（Evolver Agent）。该配置会重写提示词、通过记录有效状态-动作选择来更新记忆、调整超参数并学习工具使用流程。在我们的J-TTL基准测试中，EvoTest能持续提升性能表现，不仅优于反思和纯记忆基线方法，也超越了更复杂的在线微调方法。值得注意的是，我们的方法是唯一能在两个游戏（《侦探》和《图书馆》）中获胜的方案，而所有基线方法均未能赢得任何游戏。