Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior. We introduce SEAGym, an evaluation environment for measuring agent harness updates across training, validation, test, replay, and cost records. SEAGym turns Harbor-compatible benchmarks into dynamic self-evolution task sources with train batches, frozen update-validation, held-out ID and OOD transfer views, replay diagnostics, and saved snapshot and metric records. Instantiating SEAGym on Terminal-Bench 2.0 and HLE, we compare ACE, TF-GRPO, and AHE under a shared epoch/batch protocol. The results show that these evaluation views provide complementary signals about the evolution process: frequent updates may fail to improve held-out performance, useful intermediate snapshots may collapse later, and source diversity and model backend can affect harness reliability.
翻译:基于大型语言模型(LLM)的自我进化智能体主要通过改变其智能体框架来实现改进:即围绕基础模型的结构化执行层,包括提示词、记忆、工具、中间件、运行时状态以及模型-工具交互循环。现有评估常将此过程简化为孤立任务得分或单一连续曲线,掩盖了更新是否产生可复用改进、过拟合近期任务、增加成本或损害旧行为的问题。我们提出SEAGym,一个用于跨训练集、验证集、测试集、回放集和成本记录测量智能体框架更新的评估环境。SEAGym将Harbor兼容基准转化为动态自我进化任务源,包含训练批次、冻结更新验证、留出ID与OOD迁移视角、回放诊断以及保存的快照与指标记录。在Terminal-Bench 2.0和HLE上实例化SEAGym后,我们在统一轮次/批次协议下对比了ACE、TF-GRPO和AHE。结果表明,这些评估视角为进化过程提供了互补信号:频繁更新可能无法改善留出性能,有用的中间快照可能后续崩溃,且源多样性与模型后端会影响框架可靠性。