Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.
翻译:可验证奖励强化学习(RLVR)已成为增强大型语言模型(LLMs)推理能力的关键方法。然而,持续训练往往导致策略熵崩溃,其特征是熵的快速衰减,进而引发过早的过度自信、输出多样性降低以及抑制学习的梯度范数消失。梯度保持裁剪是影响这些动态变化的主要因素,但现有的缓解策略大多是静态的,且缺乏将裁剪机制与精确熵控制相连接的框架。本文提出从梯度保持裁剪的视角重塑强化学习中的熵控制。我们首先从理论和实证上验证了特定重要性采样比率区域对熵增长与减少的贡献。基于这些发现,我们引入了一种利用动态裁剪阈值的新型调控机制,以实现对熵的精确管理。此外,我们设计并评估了动态熵控制策略,包括先增后减、减-增-减以及振荡衰减模式。实验结果表明,这些策略能有效缓解熵崩溃,并在多个基准测试中取得更优性能。