A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.
翻译:在可验证奖励强化学习(RLVR)领域,主流观点通过探索-利用权衡的视角来解释近期进展,这一视角很大程度上由词元级指标塑造。我们重新审视这一观点,提出这种感知到的权衡可能并非根本性约束,而是测量层级带来的假象。为探究此问题,我们将分析转向语义丰富的隐状态空间,采用有效秩(ER)来量化探索,并提出其新颖的一阶和二阶导数——分别命名为有效秩速度(ERV)与有效秩加速度(ERA)——以捕捉利用动态。我们的分析表明,在隐状态层面,探索与利用可以实现解耦(第4节)。这一发现揭示了同时提升两种能力的可能性。基于此洞见,我们提出了速度利用秩学习(VERL)方法,该方法首次通过直接塑造RL优势函数来实现协同增强探索与利用的原则。其核心创新在于利用理论稳定的ERA作为预测性元控制器,构建一个协同的双通道激励结构。VERL并非强制权衡,而是前瞻性地放大探索奖励以预防过度自信,并强化利用性收益以巩固推理。在多种LLM和推理基准上的实验均显示出一致的性能提升,其中在具有挑战性的2024年高考数据集上实现了高达21.4%的绝对准确率提升。