超越探索-利用权衡：一种用于RLVR中LLM推理的隐状态方法 (Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR)

A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.

翻译：在可验证奖励强化学习（RLVR）领域，主流观点通过探索-利用权衡的视角来解释近期进展，这一视角很大程度上由词元级指标塑造。我们重新审视这一观点，提出这种感知到的权衡可能并非根本性约束，而是测量层级带来的假象。为探究此问题，我们将分析转向语义丰富的隐状态空间，采用有效秩（ER）来量化探索，并提出其新颖的一阶和二阶导数——分别命名为有效秩速度（ERV）与有效秩加速度（ERA）——以捕捉利用动态。我们的分析表明，在隐状态层面，探索与利用可以实现解耦（第4节）。这一发现揭示了同时提升两种能力的可能性。基于此洞见，我们提出了速度利用秩学习（VERL）方法，该方法首次通过直接塑造RL优势函数来实现协同增强探索与利用的原则。其核心创新在于利用理论稳定的ERA作为预测性元控制器，构建一个协同的双通道激励结构。VERL并非强制权衡，而是前瞻性地放大探索奖励以预防过度自信，并强化利用性收益以巩固推理。在多种LLM和推理基准上的实验均显示出一致的性能提升，其中在具有挑战性的2024年高考数据集上实现了高达21.4%的绝对准确率提升。