From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.

翻译：虽然基于可验证奖励的强化学习通过优化条件分布 P(y|x) 能显著提升大语言模型的推理能力，但其潜力本质上受限于基模型已有的输出分布。在预训练空间中优化边际分布 P(y) 能通过编码推理能力并保留广泛探索容量来突破这一瓶颈。然而，传统预训练依赖静态语料进行被动学习，导致分布偏移，阻碍了针对性推理能力的提升。本文提出 PreRL（预训练空间强化学习），该方法将奖励驱动的在线更新直接应用于 P(y)。我们从理论和实验两方面验证了 log P(y) 与 log P(y|x) 之间存在强梯度对齐，从而确立 PreRL 可作为标准强化学习的有效替代。进一步地，我们揭示了关键机制：PreRL 中的负样本强化充当了推理能力的异常有效驱动力。NSR-PreRL 能快速剪枝错误推理空间，同时激发内省式反思行为，使过渡思维和反思思维分别提升 14.89 倍和 6.54 倍。基于这些发现，我们提出双空间强化学习——一种策略重生策略，先用 NSR-PreRL 初始化模型以扩展推理视界，再转向标准强化学习进行精细化优化。大量实验表明，DSRL 持续超越强基线方法，证明了预训练空间剪枝能有效引导策略向精炼的正确推理子空间演进。