Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a "be concise" instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token.
翻译:推理模型虽会"出声思考",但其输出内容大多为噪声。本文提出OPSDC(论策略自蒸馏推理压缩),该方法通过将模型自身的简洁推理行为蒸馏回模型内部,从而教导模型进行更简洁的推理。整个方法可归结为一个核心思想:对同一模型施加"保持简洁"的指令约束以获取教师模型的逻辑输出,并在学生模型自身推演过程中最小化逐令牌反向KL散度。无需真实答案标签,无需令牌预算限制,无需难度评估器——仅需自蒸馏。然而这种简洁性背后蕴含着惊人的精妙:OPSDC能自动对简单问题实施激进压缩,同时保留处理难题所需的审慎推理过程。在Qwen3-8B和Qwen3-14B模型上,我们在MATH-500数据集实现57-59%的令牌压缩率,同时绝对准确率提升9-16个百分点。在AIME 2024测试中,14B模型以41%的压缩率获得10个百分点的性能提升。其奥秘何在?推理模型的输出不仅存在冗余——更存在主动危害性,每个不必要的令牌都会导致错误累积效应。