Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain-based LOss Weighting (GLOW), an adaptive, sample-aware scheme that exploits such distinctive training dynamics by rescaling per-sample loss based on inter-epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.
翻译:基于思维链轨迹演示的监督微调是激活大语言模型推理能力的常用方法。标准实践通常仅保留最终答案正确的轨迹(正向样本),而忽略其余轨迹(负向样本)。我们认为这种范式丢弃了大量监督信号并加剧过拟合,从而限制了领域外泛化能力。具体而言,我们意外地发现:将负向轨迹纳入监督微调相较于仅使用正向样本训练能带来显著的领域外泛化增益,因为这些轨迹虽最终答案错误,却常包含有效的中间推理步骤。为深入理解此效应,我们系统分析了数据特性、训练动态与推理行为,识别出负向思维链中22种反复出现的模式——这些模式具有双重作用:在训练期间调节损失下降以缓解过拟合,在推理阶段将策略熵提升35.67%以促进探索。基于这些发现,我们进一步提出基于增益的损失加权方法——一种自适应、样本感知的优化方案,该方法通过根据跨周期训练进度重新缩放样本损失,有效利用此类独特的训练动态。实证表明,GLOW能高效利用未过滤轨迹,在Qwen2.5-7B模型上实现比纯正向监督微调高5.51%的领域外增益,并作为强化学习初始化将MMLU分数从72.82%提升至76.47%。