Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not systematically accumulate reusable strategies that generalize across tasks. Modular skills can provide such reusable strategies, yet existing skill-augmented RL methods decouple skill creation from policy optimization, risking adopting skills that conflict with the evolving policy. Inspired by Anthropic's Skill Creator, we introduce ReSkill, an RL-in-the-loop skill creation framework that reconciles skill evolution with policy learning. ReSkill exploits the group-wise structure of GRPO to naturally embed three mechanisms with only marginal additional overhead: (1) an assertion-driven skill creator that diagnoses failures from past experience and proposes conditional, trigger-based skill revisions; (2) within-group rollout sampling that enables controlled comparison of skill versions, capturing which version best supports the policy's ongoing learning; and (3) Thompson Sampling with adaptive discounting to balance exploration and exploitation in skill version selection as the policy evolves. Across several domains, ReSkill consistently outperforms existing memory and skill-based RL methods, with the largest gains on unseen tasks. Analysis of the skill lifecycle shows skills being automatically created, tested, refined, and pruned as the policy improves, demonstrating reconciled skill-policy co-evolution.
翻译:智能体强化学习使大语言模型智能体能够通过环境奖励持续改进,然而由此产生的策略无法系统积累可跨任务泛化的可复用策略。模块化技能可提供此类可复用策略,但现有技能增强型强化学习方法将技能创建与策略优化相分离,存在采纳与演进策略冲突技能的风险。受Anthropic的Skill Creator启发,我们提出ReSkill——一种将技能演化与策略学习相结合的强化学习闭环技能创建框架。ReSkill利用GRPO的分组结构特性,仅需微量额外开销即可自然嵌入三种机制:(1)基于断言的技能创建器,通过诊断过往经验中的失败模式提出条件触发式技能修正方案;(2)组内轨迹采样实现技能版本的受控对比,精准识别最适配策略持续学习的技能版本;(3)结合自适应折扣的汤普森采样,在策略演进过程中平衡技能版本选择的探索与利用。在多个领域中,ReSkill始终优于现有基于记忆和技能的强化学习方法,尤其在未见任务上表现突出。对技能生命周期的分析表明,随着策略优化,技能能够自动完成创建、测试、优化与裁剪,验证了技能与策略协同演化的可行性。