PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

Self-evolving agents improve by repeatedly proposing changes to their own prompts, skills, or workflows and keeping those that score higher on a small held-out set. Almost all effort has gone into the proposer that generates candidates; we argue the weak point is the acceptor, the rule that decides whether to commit a change. Applied hundreds of times against the same noisy dev estimate, the ubiquitous "keep it if the score went up" rule is uncontrolled adaptive multiple testing: the agent effectively p-hacks itself, accumulating false commits that make it churn and drift rather than improve. We recast committing as a sequential hypothesis test and propose PACE (Paired Anytime-valid Commit Evaluation), a training-free, anytime-valid commit gate. Each candidate is compared to the incumbent on identical instances and committed only when a testing-by-betting e-process accumulates decisive evidence, stopping early to save evaluations and controlling each candidate's false-commit probability at a user-set level even under optional stopping (a per-decision guarantee). On Qwen2.5 agents (0.5B-3B) self-evolving at the prompt level on GSM8K, SVAMP, and ARC-Challenge, greedy acceptance commits 30-42% false and 10-33% harmful edits when a genuine improvement is hidden among noisy proposals, while PACE commits the real one and essentially nothing else, matching greedy's held-out accuracy at sharply lower variance and about 18% lower evaluation cost. With no real gain available, greedy commits 13-21 spurious self-modifications per run (72-100% false) and degrades the most fragile agent by 4.9 points, while PACE holds at baseline. Reliability of self-evolution depends on the acceptor, not only on the proposer.

翻译：自进化智能体通过反复对其自身提示词、技能或工作流程提出修改建议，并仅保留那些在小型预留集上评分更高的改动来实现自我改进。几乎所有研究都集中于生成候选方案的提议者；我们认为薄弱环节在于决策者，即决定是否采纳某项变更的规则。在数百次应用于相同的含噪开发集评估后，普遍采用的"若评分提升则保留"规则实际上是一种不受控制的自适应多重检验：智能体实质上在对自己进行p值操纵，累积虚假提交，导致其陷入扰动和漂移而非真正进步。我们将提交过程重新定义为序列假设检验，并提出PACE（成对任意有效提交评估），一种无需训练、任意有效的提交门控机制。每个候选方案在与当前方案相同的实例上进行对比，仅当通过"基于博弈检验的e过程"积累到决定性证据时才予以提交，该机制可提前停止以节省评估资源，并在可选停止条件下将每个候选方案的虚假提交概率控制在用户设定的水平（一种每决策保证）。在Qwen2.5智能体（0.5B-3B参数规模）于GSM8K、SVAMP和ARC-Challenge数据集上进行提示词级别自进化的实验中，当真实改进隐藏于含噪候选方案中时，贪婪接受策略提交了30-42%的虚假编辑和10-33%的有害编辑；而PACE仅提交真实改进方案且几乎不产生其他提交，在保持与贪婪策略相当的留出集准确率的同时，显著降低了方差并节省约18%的评估成本。当不存在真实改进机会时，贪婪策略每次运行提交13-21次虚假自修改（72-100%为虚假修改），并使最脆弱的智能体性能下降4.9个点；而PACE则将智能体稳定维持在基线水平。自进化可靠性取决于决策者，而不仅仅是提议者。