从模仿到信任：面向智能体强化学习的渐进式探索自模仿方法 (Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning)

Yulei Qin,Xiaoyu Tan,Zhengbao He,Gang Li,Haojia Lin,Zongyi Li,Zihan Xu,Yuchen Shi,Siqi Cai,Renting Rui,Shaofei Cai,Yuzheng Cai,Xuan Zhang,Sheng Ye,Ke Li,Xing Sun

from arxiv, 26 pages, 11 figures

Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.

翻译：强化学习（RL）是提升大语言模型在长视野、稀疏奖励智能体任务中策略性工具使用能力的主导范式，但其面临探索与利用权衡的根本性挑战。现有研究通过策略熵的视角激励探索，然而这种机械的熵最大化容易因多轮次分布偏移导致RL训练不稳定。本文旨在基于智能体自身经验的引导下实现渐进式探索与利用的平衡，避免陷入熵崩溃或无界发散。我们提出SPEAR，一种基于课程学习的自模仿学习（SIL）框架，用于训练智能体大语言模型。该方法扩展了基础SIL框架（通过回放缓冲区存储自生成的优质轨迹进行离策略更新），通过分阶段将策略演化逐步引导至熵值平衡的范围内。具体而言，我们的方法引入课程机制管理探索过程：利用内在奖励促进技能级探索，并通过SIL实现动作级探索。在初期，辅助工具调用奖励对工具使用技能的积累至关重要，使模型能以上升的熵趋势广泛接触环境反馈的陌生分布。随着训练推进，自模仿机制逐渐增强，通过回放经验中既有成功模式进行对比性动作级探索，在避免熵无界增长的同时加速解决方案迭代。为提升训练稳定性，我们重新校准回放缓冲区中经验的价值优势以应对潜在策略漂移。通过引入正则化方法（如对概率与优势值协方差较高的词元进行截断）至轨迹级熵控制，以抑制过度自信。