Reinforcement learning (RL) has unlocked complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from performance saturation, preventing further gains as RL training scales. This problem can be characterized by the collapse of entropy, a key diagnostic for exploration in RL. Existing attempts have tried to prevent entropy collapse through regularization or clipping, but their resulting entropy curves often exhibit instability in the long term, which hinders performance gains. In this paper, we introduce Entrocraft, a simple rejection-sampling approach that realizes any user-customized entropy schedule by biasing the advantage distributions. Entrocraft requires no objective regularization and is advantage-estimator-agnostic. Theoretically, we relate per-step entropy change to the advantage distribution under minimal assumptions, which explains the behavior of existing RL and entropy-preserving methods. Entrocraft also enables a systematic study of entropy schedules, where we find that linear annealing, which starts high and decays to a slightly lower target, performs best. Empirically, Entrocraft addresses performance saturation, significantly improving generalization, output diversity, and long-term training. It enables a 4B model to outperform an 8B baseline, sustains improvement for up to 4x longer before plateauing, and raises pass@K by 50% over the baseline.
翻译:强化学习(RL)已解锁大语言模型(LLMs)的复杂推理能力。然而,多数RL算法存在性能饱和问题,随着RL训练规模扩大,性能提升停滞。该问题可通过熵坍塌(RL探索的关键诊断指标)来表征。现有方法尝试通过正则化或剪裁防止熵坍塌,但其生成的熵曲线在长期训练中常出现不稳定性,进而阻碍性能提升。本文提出Entrocraft——一种通过偏置优势分布实现用户自定义熵曲线的简单拒绝采样方法。Entrocraft无需目标函数正则化,且与优势估计器无关。理论上,我们在最小假设下建立了单步熵变化与优势分布之间的关联,揭示了现有RL方法及熵保持方法的行为机制。Entrocraft还支持对熵调度的系统性研究,我们发现线性退火策略(初始熵值较高,逐步衰减至略低目标值)表现最佳。实验表明,Entrocraft解决了性能饱和问题,显著提升了泛化能力、输出多样性和长期训练效果。它使4B参数模型超越8B基线模型,性能在平台期前延长了4倍训练收益,并将pass@K指标较基线提升50%。