Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments. The inherent sparsity of outcome-based rewards also poses a substantial challenge, as agents typically receive feedback only upon completion of tasks. To address these limitations, we introduce a Tool-Memory based self-evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel state abstraction that facilitates generalization across analogous contexts, such as tool reuse. Consequently, agents extract explicit knowledge from historical data while leveraging inter-trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and mathematics tasks, demonstrating its effectiveness in achieving more practical and efficient learning.
翻译:基于可验证奖励的强化学习(RLVR)近期进展已在单步推理任务中展现出显著潜力。随着范式向自演化智能体学习的转变,模型越来越需要通过学习轨迹来综合工具或积累显式经验。然而,现有方法通常依赖大规模语言模型或多智能体框架,这阻碍了其在资源受限环境中的部署。结果奖励固有的稀疏性也构成重大挑战,因为智能体通常在任务完成后才能获得反馈。为解决这些局限,我们提出一种基于工具记忆的自演化智能体框架SEARL。与直接利用交互经验的方法不同,我们的方法构建了一种融合规划与执行的结构化经验记忆,通过提供新型状态抽象来促进跨类似情境(如工具复用)的泛化能力。由此,智能体既可从历史数据中提取显式知识,又能利用轨迹间相关性稠密化奖励信号。我们在知识推理与数学任务上评估了该框架,验证了其在实现更实用高效学习方面的有效性。