Foundation models exhibit broad knowledge but limited task-specific reasoning, motivating post-training strategies such as RL with verifiable rewards (RLVR) and test-time scaling (TTS). While recent work highlights the role of exploration in improving pass@K, empirical evidence points to a paradox: RLVR and ORM/PRM typically reinforce existing paths rather than expanding the reasoning scope, raising the question of why exploration helps if no new patterns emerge. To reconcile this paradox, we adopt the perspective of Kim et al. (2025), viewing easy (e.g., simplifying a fraction) versus hard (e.g., discovering the some symmetry) reasoning steps as low versus high probability Markov transitions. In this tractable model, pretraining corresponds to tree-graph discovering, while post-training corresponds to CoT reweighting. We provably show that, both RLVR and ORM/PRM would favor heavily to several high-probability paths, and thereby forget rare-but-crucial CoTs. Building on this, we further prove that exploration strategies such as rejecting easy instances and KL regularization help preserve rare CoTs. Empirical simulations corroborate our theoretical results.
翻译:基础模型虽具备广泛知识,但在特定任务上的推理能力有限,这催生了诸如基于可验证奖励的强化学习(RLVR)和测试时扩展(TTS)等后训练策略。尽管近期研究强调了探索在提升pass@K指标中的作用,但实验证据揭示了一个悖论:RLVR与ORM/PRM通常倾向于强化现有路径而非拓展推理范围,从而引发疑问——若未出现新模式,探索为何仍有助益?为调和这一悖论,我们借鉴Kim等人(2025)的视角,将简单推理步骤(如约分分数)与复杂推理步骤(如发现某种对称性)分别视为低概率与高概率的马尔可夫转移。在该易处理模型中,预训练对应于树形图发现,而后训练则对应于思维链权重再分配。我们通过理论证明表明:无论是RLVR还是ORM/PRM,都会显著偏向若干高概率路径,进而遗忘罕见但关键的思维链(CoT)。基于此,我们进一步证明,拒绝简单实例与KL正则化等探索策略有助于保留罕见思维链。实验模拟验证了我们的理论结果。