Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.
翻译:大型语言模型通过可验证奖励的强化学习(RLVR)在推理能力上取得了显著提升。然而,随着模型能力增长,构建高质量奖励信号变得越来越困难,因此理解RLVR在何种弱监督条件下能够成功至关重要。我们针对多种模型族和推理领域,在三种弱监督设置(数据稀缺、奖励噪声、自监督代理奖励)下开展了系统性实证研究。研究发现,泛化能力由训练奖励饱和动力学决定:能泛化的模型会经历一个长时间的预饱和阶段,在此期间训练奖励与下游性能同步提升;而快速饱和的模型则倾向于记忆而非学习。我们将推理忠实性(即中间步骤逻辑支撑最终答案的程度)识别为预测模型归属于哪种机制的关键预强化学习属性,而输出多样性本身并无参考价值。基于这些发现,我们解耦了持续预训练和监督微调的贡献,发现对显式推理轨迹进行监督微调是弱监督下泛化的必要条件,而领域数据上的持续预训练则能放大该效应。将这两种干预措施联合应用于Llama3.2-3B-Base模型后,其在先前基础模型失败的所有三种弱监督设置下均实现了泛化。