Agentic reinforcement learning (RL) can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance and often lack principled mechanisms for maintaining an evolving skill memory. We propose D2Skill, a dynamic dual-granularity skill bank for agentic RL that organizes reusable experience into task skills for high-level guidance and step skills for fine-grained decision support and error correction. D2Skill jointly trains the policy and skill bank through paired baseline and skill-injected rollouts under the same policy, using their performance gap to derive hindsight utility signals for both skill updating and policy optimization. Built entirely from training-time experience, the skill bank is continuously expanded through reflection and maintained with utility-aware retrieval and pruning. Experiments on ALFWorld and WebShop with Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 show that D2Skill consistently improves success rates over skill-free baselines by 10-20 points. Further ablations and analyses show that both dual-granularity skill modeling and dynamic skill maintenance are critical to these gains, while the learned skills exhibit higher utility, transfer across evaluation settings, and introduce only modest training overhead.
翻译:智能体强化学习(Agentic RL)能够从可重用的经验中显著获益,然而现有基于技能的方法主要提取轨迹级别的指导,且大多缺乏维护不断演进的技能记忆的规范化机制。我们提出D2Skill——一种面向智能体强化学习的动态双粒度技能库,将可重用的经验组织为用于高层指导的任务技能和用于细粒度决策支持与错误纠正的步骤技能。D2Skill通过在同一策略下执行成对的基线轨迹与技能注入轨迹,利用两者间的性能差距推导出自反效用信号,同时用于技能更新与策略优化。该技能库完全由训练过程中的经验构建,通过反思机制不断扩展,并采用效用感知的检索与剪枝进行维护。在搭载Qwen2.5-7B-Instruct与Qwen3-4B-Instruct-2507的ALFWorld和WebShop平台上的实验表明,相较于无技能基线方法,D2Skill的成功率一致性提升10-20个百分点。进一步的消融实验与分析表明,双粒度技能建模与动态技能维护对性能提升均至关重要,且所习得技能表现出更高效用、跨评估场景的可迁移性,同时仅引入适度的训练开销。