Reinforcement learning (RL) has become a central component of post-training for large language models (LLMs), particularly for complex reasoning tasks that require stable optimization over long generation horizons. However, achieving performance at scale often introduces a fundamental trade-off between training stability and training efficiency. Token-level optimization applies fine-grained updates at the individual units, but is prone to high variance in gradient estimation, which can result in unstable training dynamics. In contrast, Sequence-level optimization often relies on aggressive clipping mechanisms to ensure stable updates. However, such design may discard a large fraction of valid training samples, leading to inefficient gradient utilization and reduced training efficiency. We refer to this phenomenon as gradient underutilization. In this work, we propose Entropy Importance Sampling Policy Optimization (ESPO), a novel framework that aims to combine fine-grained updates with stable training. ESPO decomposes sequences into groups based on predictive entropy, enabling (1) Entropy Grouping Importance Sampling to capture intra-sequence heterogeneity, and (2) Entropy Adaptive Clipping to dynamically allocate trust regions based on model uncertainty. Extensive experiments on mathematical reasoning benchmarks demonstrate that ESPO not only accelerates convergence but also achieves state-of-the-art performance, notably improving accuracy on the challenging mathematical benchmarks.
翻译:强化学习已成为大型语言模型后训练的核心组成部分,尤其适用于需要长生成序列稳定优化的复杂推理任务。然而,大规模性能提升往往伴随着训练稳定性与训练效率之间的根本性权衡。词元级优化在个体单元上实施细粒度更新,但容易产生梯度估计的高方差,导致训练动态不稳定。相比之下,序列级优化通常依赖激进的截断机制确保稳定更新,但此类设计可能丢弃大量有效训练样本,导致梯度利用不足和训练效率降低,我们将此现象称为梯度欠利用。本研究提出基于熵重要性采样的策略优化框架,旨在融合细粒度更新与稳定训练。该框架基于预测熵将序列分解为不同组别,从而实现:(1)熵分组重要性采样以捕捉序列内异质性;(2)熵自适应截断以基于模型不确定性动态分配置信区域。在数学推理基准测试上的大量实验表明,该框架不仅加速收敛过程,更实现了最先进的性能表现,尤其在具有挑战性的数学基准测试中显著提升了准确率。