Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm

Recent advances in large language models (LLMs) and agent system designs have empowered agents with unprecedented levels of capability. However, existing agent benchmarks are showing a trend of rapid ceiling-hitting by newly developed agents, making it difficult to meet the demands for evaluating agent abilities. To address this problem, we propose the Trajectory-based Validated-by-Reproducing Agent-benchmark Complexity Evolution (TRACE) framework. This framework takes an original task from an existing benchmark and encourages agents to freely explore and evolve it into a new task with higher difficulty while recording validatable agent trajectories. The framework proceeds in three stages: (1) evolutionary proposal mining, which provides task evolution proposals through preliminary exploration and divergent thinking; (2) problem formation and free exploration, where proposals are conceptualized into feasible problem candidates and the agents then explore them freely while recording their execution trajectories; and (3) multi-level validation, which ensures that the evolved tasks are accompanied by validatable and reproducible trajectories. Experiments on the GAIA benchmark demonstrate that the TRACE framework consistently enhances task complexity while improving the reliability of correctness through validatable execution trajectories. In addition, our framework can successfully adapt to and improve reasoning datasets represented by AIME-2024. This work marks a paradigm shift from static, manually curated benchmarks to dynamic, self-evolving evaluation systems, providing a sustainable and challenging runway for agent development

翻译：近年来，大型语言模型与智能体系统设计的进展赋予了智能体前所未有的能力水平。然而，现有智能体基准测试正呈现被新型智能体快速触及天花板的趋势，难以满足评估智能体能力的需求。针对这一问题，我们提出基于轨迹的"验证即复现"智能体基准复杂度演化框架。该框架从现有基准中选取原始任务，鼓励智能体自由探索并将其演化为难度更高的新任务，同时记录可验证的智能体轨迹。本框架包含三个阶段：（1）演化提案挖掘，通过初步探索与发散性思维生成任务演化方案；（2）问题形式化与自由探索，将方案转化为可行问题候选集，智能体自由探索并记录执行轨迹；（3）多层级验证，确保演化任务伴随可验证且可复现的轨迹。基于GAIA基准的实验表明，TRACE框架在持续提升任务复杂度的同时，通过可验证的执行轨迹提高了正确性评估的可靠性。此外，本框架可成功适配并改进以AIME-2024为代表的推理数据集。这项工作标志着从静态人工维护基准向动态自进化评估系统的范式转变，为智能体发展提供了可持续且富有挑战性的跑道。