Reinforcement Learning with Verifiable Rewards (RLVR), particularly GRPO, has become the standard for eliciting LLM reasoning. However, its efficiency in exploration and difficulty adaptation remains an open challenge. In this work, we argue that these bottlenecks stem from an implicit advantage symmetry inherent in Group Relative Advantage Estimation (GRAE). This symmetry induces two critical limitations: (i) at the group level, strict symmetry in weights between correct and incorrect trajectories leaves unsampled action logits unchanged, thereby hindering exploration of novel correct solution. (ii) at the sample level, the algorithm implicitly prioritizes medium-difficulty samples, remaining agnostic to the non-stationary demands of difficulty focus. Through controlled experiments, we reveal that this symmetric property is sub-optimal, yielding two pivotal insights: (i) asymmetrically suppressing the advantages of correct trajectories encourages essential exploration. (ii) learning efficiency is maximized by a curriculum-like transition-prioritizing simpler samples initially before gradually shifting to complex ones. Motivated by these findings, we propose Asymmetric GRAE (A-GRAE), which dynamically modulates exploration incentives and sample-difficulty focus. Experiments across seven benchmarks demonstrate that A-GRAE consistently improves GRPO and its variants across both LLMs and MLLMs.
翻译:具备可验证奖励的强化学习(RLVR),尤其是GRPO,已成为激发大型语言模型推理能力的主流方法。然而,其在探索效率与难度适应方面的表现仍是一个悬而未决的挑战。本文认为,这些瓶颈源于组相对优势估计(GRAE)中固有的隐式优势对称性。这种对称性引发了两个关键局限:(i)在组层面,正确与错误轨迹间权重的严格对称性使得未采样的动作对数概率保持不变,从而阻碍了对新颖正确解的探索。(ii)在样本层面,算法隐式地优先处理中等难度样本,对难度聚焦的非平稳需求缺乏感知。通过受控实验,我们揭示了这种对称特性的次优性,并得出两个关键洞见:(i)非对称地抑制正确轨迹的优势能够促进必要的探索。(ii)学习效率在一种课程式转换下达到最优——即初期优先处理简单样本,随后逐步转向复杂样本。基于这些发现,我们提出了非对称GRAE(A-GRAE),它能动态调节探索激励与样本难度聚焦。在七个基准测试上的实验表明,A-GRAE在大型语言模型及多模态大语言模型中均能持续提升GRPO及其变体的性能。