Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments. The inherent sparsity of outcome-based rewards also poses a substantial challenge, as agents typically receive feedback only upon completion of tasks. To address these limitations, we introduce a Tool-Memory based self-evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel state abstraction that facilitates generalization across analogous contexts, such as tool reuse. Consequently, agents extract explicit knowledge from historical data while leveraging inter-trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and mathematics tasks, demonstrating its effectiveness in achieving more practical and efficient learning.
翻译:基于可验证奖励的强化学习(RLVR)在单步推理任务中展现出显著潜力。随着学习范式向自我进化的智能体学习转变,模型越来越需要通过综合工具或积累显式经验,从轨迹中进行学习。然而,当前主流方法通常依赖大规模语言模型或多智能体框架,这限制了其在资源受限环境中的部署。此外,基于结果的奖励固有的稀疏性也构成了重大挑战,因为智能体通常仅在完成任务后才能获得反馈。为应对这些限制,我们提出了一种基于工具记忆的自我进化智能体框架SEARL。与直接利用交互经验的方法不同,我们的方法构建了一种将规划与执行相结合的结构化经验记忆。这提供了一种新颖的状态抽象,便于在类似情境(如工具复用)中实现泛化。因此,智能体能够从历史数据中提取显式知识,同时利用轨迹间的相关性来加密奖励信号。我们在知识推理与数学任务上评估了该框架,证明了其在实现更实用、更高效学习方面的有效性。