Reinforcement Learning with Verifiable Rewards (RLVR) serves as a cornerstone technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, its training is often plagued by \emph{entropy collapse}, a rapid decline in policy entropy that limits exploration and undermines training effectiveness. While recent works attempt to mitigate this issue via several heuristic entropy interventions, the underlying mechanisms remain poorly understood. In this work, we conduct comprehensive theoretical and empirical analyses of entropy dynamics in RLVR, offering two main insights: (1) We derive a tight analytical approximation for token-level entropy change at each update step, revealing four governing factors and providing a unified theoretical framework to explain how existing methods influence entropy; (2) We reveal a fundamental limitation of recent approaches: they rely on heuristic adjustments to one or two of these factors, leaving other relevant factors unconsidered, thus inherently limiting their effectiveness. Motivated by these findings, we propose STEER, a principled entropy-modulation method that adaptively reweights tokens based on theoretically-estimated entropy variations. Extensive experiments across six mathematical reasoning and three coding benchmarks demonstrate that STEER effectively mitigates entropy collapse and consistently outperforms state-of-the-art baselines.
翻译:具有可验证奖励的强化学习是增强大型语言模型推理能力的关键技术。然而,其训练常受限于\textit{熵崩塌},即策略熵的快速下降,这限制了探索并削弱了训练效果。尽管近期研究尝试通过若干启发式熵干预方法来缓解此问题,但其内在机制仍不明确。本文对RLVR中的熵动力学进行了全面的理论与实证分析,主要贡献如下:(1)我们推导出每一步更新时令牌级熵变化的紧致分析近似,揭示出四个主导因素,并提供了一个统一的理论框架来解释现有方法如何影响熵;(2)我们揭示了近期方法的一个根本性局限:它们仅基于启发式调整其中一个或两个因素,而忽视了其他相关因素,因此本质上限制了其效果。受这些发现启发,我们提出了STEER,一种基于理论估计熵变化自适应重新加权令牌的熵调控方法。在六个数学推理和三个编码基准上的大量实验表明,STEER能有效缓解熵崩塌,并始终优于现有最先进基线方法。