Knowledge distillation from powerful reasoning models is widely used to improve Small Language Models (SLMs) on mathematical reasoning, often assuming that traces with higher reward model scores provide more useful supervision. We identify a counterintuitive \textbf{Quality-Utility Paradox} in mathematical reasoning distillation. Data refined or synthesized by a stronger Oracle obtains higher perceived quality according to reward models, yet consistently underperforms traces generated by the SLM itself and selected through rejection sampling across Qwen2.5, LLaMA-3, and DeepSeek families. Our analysis shows that Oracle refinement couples logical repair with distributional drift away from the SLM's native reasoning distribution. This drift increases the learner's adaptation cost and can outweigh the benefit of improved reasoning logic. To test this mechanism, we introduce \textbf{Style-Aligned Refinement}, which preserves the native trajectory of the SLM while retaining logical repair from the Oracle. This intervention lowers adaptation cost and restores downstream utility. These findings suggest that effective mathematical reasoning distillation should jointly optimize perceived solution quality and learner-data compatibility, rather than relying solely on reward-model scores. The datasets and code are available at https://github.com/Dracoqhl/Quality-Utility-Paradox.
翻译:从强大推理模型中进行知识蒸馏被广泛用于提升小型语言模型(SLM)在数学推理任务上的表现,其潜在假设是:具有更高奖励模型评分的轨迹能提供更有用的监督信号。然而,我们在数学推理蒸馏中发现了一个反直觉的**质量-效用悖论**。经过更强Oracle模型精炼或合成的数据,根据奖励模型评分虽然具有更高的感知质量,但在Qwen2.5、LLaMA-3和DeepSeek系列模型上,其表现始终不如由SLM自身生成并通过拒绝采样选出的轨迹。我们的分析表明,Oracle精炼在修复逻辑错误的同时,也导致了SLM原生推理分布的偏移。这种偏移增加了学习者的适配代价,可能抵消改进推理逻辑带来的收益。为验证这一机制,我们引入了**风格对齐精炼**方法,该方法在保留Oracle逻辑修复功能的同时,保持了SLM的原生轨迹风格。该干预措施降低了适配代价并恢复了下游效用。这些发现表明,有效的数学推理蒸馏应同时优化感知解答质量与学习者-数据兼容性,而非单纯依赖奖励模型评分。相关数据集和代码已开源在https://github.com/Dracoqhl/Quality-Utility-Paradox。