Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.
翻译:记忆已成为自进化智能体的标准基板,但保留经验并不等同于学会借助经验进化。现有记忆智能体虽能存储轨迹、检索反思或积累技能,却普遍缺乏选择有效经验、付诸行动、编写可复用知识以及维护持续增长知识库的全维能力。我们提出OPD-Evolver——一种基于on-policy自蒸馏的慢-快双环协同进化框架,用于培育此类智能体进化器。在快环中,OPD-Evolver与四级记忆层级交互,实现快速测试时进化所需经验的读取、使用、编写与维护。在慢环中,基于结果校准的记忆归因与特权后见蒸馏机制,将上述四种能力嵌入可部署策略。跨多领域基准测试显示,OPD-Evolver在ReasoningBank等记忆系统上提升达11.5%,在Skill0等基于训练的方法上提升约5.8%。进一步分析表明,OPD-Evolver内化了高价值经验与记忆管理能力,使OPD-Evolver-9B能够挑战Qwen3.5-397B-A17B与Step-3.5-Flash等庞大模型,标志着从单纯记忆增强型智能体迈向真正合格的智能体进化器。