Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models (LLMs). However, it faces a fundamental limitation termed \textit{restricted exploration}, where the policy rapidly converges to a narrow set of solutions. While entropy regularization is a popular approach used to sustain exploration, it often proves unreliable for LLMs, suffering from high hyperparameter sensitivity and yielding only marginal performance gains. Motivated by these inefficiencies, we propose to rethink the relationship between policy entropy and exploration. By deriving a parametric formulation of group-relative advantage estimation and analyzing entropy dynamics, we conceptually decompose policy entropy into \textit{informative entropy}, which preserves diverse solution paths, and \textit{spurious entropy}, which erodes reasoning patterns. Our analysis reveals that, in contrast to blind maximization, effective exploration requires \textit{entropy refinement}-a mechanism implicitly embedded in group-relative advantage estimation that sustains informative entropy on positive rollouts while suppressing spurious entropy on negative ones. Guided by this insight, we propose \textbf{AsymGRPO}, an exploratory framework that explicitly decouples the modulation of positive and negative rollouts. This allows for independent control over the preservation of informative entropy and the suppression of spurious noise. Extensive experiments demonstrate that AsymGRPO achieves superior performance compared to strong baselines and exhibits the potential to synergize with existing entropy regularization methods.
翻译:基于可验证奖励的强化学习(RLVR)显著提升了大型语言模型(LLMs)的推理能力。然而,它面临一个根本性局限——即策略会迅速收敛到一组狭窄的解决方案,这一现象被称为“受限探索”。虽然熵正则化是一种常用的维持探索的手段,但对于大型语言模型而言,它往往并不可靠,具有较高的超参数敏感性,且仅能带来边际性能提升。受这些低效问题的启发,我们提议重新审视策略熵与探索之间的关系。通过推导组相对优势估计的参数化公式并分析熵动态,我们概念性地将策略熵分解为“信息熵”——保留多样化解空间的熵,以及“虚假熵”——侵蚀推理模式的熵。我们的分析揭示,与盲目最大化相反,有效的探索需要“熵优化”——这是一种隐式嵌入于组相对优势估计中的机制,它在正向轨迹上维持信息熵,同时在负向轨迹上抑制虚假熵。受此洞见指引,我们提出**AsymGRPO**,一个明确解耦正向与负向轨迹调制的探索框架。这使得对信息熵的保留与对虚假噪声的抑制能够独立控制。大量实验表明,与强基线相比,AsymGRPO实现了更优的性能,并展现出与现有熵正则化方法协同增效的潜力。