Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model's full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model's high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.
翻译:可验证奖励强化学习(RLVR)日益被视为一种树剪枝机制。然而,我们识别出一种称为递归空间收缩(RSC)的系统性病理现象,这是一种由正锐化与负挤压的联合动力学驱动的不可逆崩溃过程,其中有效备选方案的采样概率趋于零。虽然KL正则化旨在缓解此问题,但它施加了严格的形状匹配约束,迫使策略模仿参考模型的完整密度分布,从而与正确性所需的锐化过程产生梯度冲突。我们提出锚定策略优化(APO),将范式从全局形状匹配转向支撑覆盖。通过基于参考模型高置信度支撑定义安全流形,APO允许为提升效率而进行激进锐化,同时在纠错过程中选择性调用恢复力以防止崩溃。我们从理论上推导出APO可作为梯度对齐机制来最大化支撑覆盖,实现重新激活有效分支的弹性恢复。在数学基准测试上的实证评估表明,APO打破了准确性与多样性的权衡,在显著提升Pass@1指标的同时,恢复了标准策略梯度方法通常丢失的Pass@K多样性。