Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: https://github.com/Visual-AI/SPoT
翻译:通过后训练增强大型语言模型(LLM)的推理能力常受限于效率与灾难性遗忘之间的权衡。尽管先前研究强调了策略内数据在缓解遗忘中的作用,我们通过理论与实证发现并验证了一个被忽视却关键的机制:直接偏好优化(DPO)奖励估计中固有的隐式正则化。这启发了我们提出手术式后训练(SPoT),一种旨在高效优化推理能力同时保留已习得先验知识的新范式。SPoT包含:(1)数据校正流程,利用Oracle通过最小化编辑对错误步骤进行手术式修正,生成接近模型分布的数据;(2)基于奖励的二元交叉熵目标。与DPO中的相对排序不同,该目标将推理正确性视为二元分类问题,实施解耦的监督信号。实证表明,仅使用4k个校正后的数学数据对,SPoT将Qwen3-8B在领域内和领域外任务上的平均准确率提升了6.2%,且仅需在8×H800 GPU上进行28分钟训练。代码:https://github.com/Visual-AI/SPoT