Reinforcement learning with verifiable rewards (RLVR) has demonstrated superior performance in enhancing the reasoning capability of large language models (LLMs). However, this accuracy-oriented learning paradigm often suffers from entropy collapse, which reduces policy exploration and limits reasoning capabilities. To address this challenge, we propose an efficient reinforcement learning framework that leverages entropy signals at both the semantic and token levels to improve reasoning. From the data perspective, we introduce semantic entropy-guided curriculum learning, organizing training data from low to high semantic entropy to guide progressive optimization from easier to more challenging tasks. For the algorithmic design, we adopt non-uniform token treatment by imposing KL regularization on low-entropy tokens that critically impact policy exploration and applying stronger constraints on high-covariance portions within these tokens. By jointly optimizing data organization and algorithmic design, our method effectively mitigates entropy collapse and enhances LLM reasoning. Experimental results across 6 benchmarks with 3 different parameter-scale base models demonstrate that our method outperforms other entropy-based approaches in improving reasoning.
翻译:基于可验证奖励的强化学习(RLVR)在提升大语言模型(LLMs)的推理能力方面已展现出卓越性能。然而,这种以准确性为导向的学习范式常面临熵崩溃问题,导致策略探索受限并削弱推理能力。为应对这一挑战,我们提出一种高效的强化学习框架,利用语义与词元两个层面的熵信号以改进推理。从数据角度,我们引入语义熵引导的课程学习,将训练数据按语义熵从低到高组织,以引导模型从易到难的任务中逐步优化。在算法设计上,我们采用非均匀词元处理策略:对关键影响策略探索的低熵词元施加KL正则化,并在这些词元内部的高协方差部分施加更强约束。通过联合优化数据组织与算法设计,我们的方法有效缓解了熵崩溃并增强了LLM的推理能力。在涵盖6个基准测试、使用3种不同参数规模基础模型的实验结果表明,本方法在提升推理能力方面优于其他基于熵的现有方法。