Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.
翻译:自蒸馏已成为大语言模型有效的后训练范式,常常能在缩短推理轨迹的同时提升性能。然而,在数学推理中,我们发现自蒸馏可能缩短响应长度,但会降低性能。我们将这种性能下降归因于认知言语化——模型在推理过程中表达不确定性的能力——受到抑制。通过控制条件上下文丰富度和任务覆盖范围的对照实验,我们表明:对教师模型施加丰富信息条件会抑制不确定性表达,从而实现有限任务覆盖下的快速领域内优化,但会损害分布外性能,因为未见问题需要表达不确定性并据此调整。在 Qwen3-8B、DeepSeek-Distill-Qwen-7B 和 Olmo3-7B-Instruct 上,我们观察到性能下降高达 40%。我们的发现强调,暴露适当水平的不确定性对于稳健推理至关重要,并凸显了优化推理行为(而不仅仅是强化正确答案轨迹)的重要性。