Entropy serves as a critical metric for measuring the diversity of outputs generated by large language models (LLMs), providing valuable insights into their exploration capabilities. While recent studies increasingly focus on monitoring and adjusting entropy to better balance exploration and exploitation in reinforcement fine-tuning (RFT), a principled understanding of entropy dynamics during this process is yet to be thoroughly investigated. In this paper, we establish a theoretical framework for analyzing the entropy dynamics during the RFT process, which begins with a discriminant expression that quantifies entropy change under a single logit update. This foundation enables the derivation of a first-order expression for entropy change, which can be further extended to the update formula of Group Relative Policy Optimization (GRPO). The corollaries and insights drawn from the theoretical analysis inspire the design of entropy control methods, and also offer a unified lens for interpreting various entropy-based methods in existing studies. We provide empirical evidence to support the main conclusions of our analysis and demonstrate the effectiveness of the derived entropy-discriminator clipping methods. This study yields novel insights into RFT training dynamics, providing theoretical support and practical strategies for optimizing the exploration-exploitation balance during LLM fine-tuning.
翻译:熵作为衡量大型语言模型(LLM)输出多样性的关键指标,为理解其探索能力提供了重要视角。尽管近期研究日益关注在强化微调(RFT)过程中通过监控和调整熵来优化探索与利用的平衡,但对该过程中熵动态的机理认知仍有待深入探究。本文建立了分析RFT过程中熵动态的理论框架,该框架始于一个量化单次逻辑值更新下熵变化的判别表达式。基于此基础,我们推导出熵变化的一阶表达式,并可进一步扩展至组相对策略优化(GRPO)的更新公式。理论分析所得的推论与启示不仅启发了熵控制方法的设计,也为现有研究中各类基于熵的方法提供了统一的解释视角。我们通过实证证据支持分析的主要结论,并验证了所推导的熵判别器截断方法的有效性。本研究为RFT训练动态提供了新的见解,为优化LLM微调过程中的探索-利用平衡提供了理论支撑与实践策略。