Constraint-Rectified Training for Efficient Chain-of-Thought

Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), especially when combined with reinforcement learning (RL) based post-training methods. While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking. Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy, either through length-aware reward design or prompt-based calibration. However, these heuristic-based approaches may suffer from severe accuracy drop and be very sensitive to hyperparameters. To address these problems, we introduce CRT (Constraint-Rectified Training), a principled post-training framework based on reference-guarded constrained optimization, yielding a more stable and interpretable formulation for efficient reasoning. CRT alternates between minimizing reasoning length and rectifying accuracy only when performance falls below the reference, enabling stable and effective pruning of redundant reasoning. We further extend CRT with a two-stage training scheme that first discovers the shortest reliable reasoning patterns and then refines accuracy under a learnt length budget, preventing the re-emergence of verbose CoT. Our comprehensive evaluation shows that this framework consistently reduces token usage while maintaining answer quality at a robust and reliable level. Further analysis reveals that CRT improves reasoning efficiency not only by shortening responses but also by reducing internal language redundancy, leading to a new evaluation metric. Moreover, CRT-based training naturally yields a sequence of intermediate checkpoints that span a spectrum of explanation lengths while preserving correctness, enabling fine-grained control over reasoning verbosity without retraining.

翻译：思维链（CoT）显著增强了大型语言模型（LLM）的推理能力，特别是在与基于强化学习（RL）的后训练方法结合时。虽然更长的推理轨迹可以提高答案质量并解锁自我修正等能力，但它们也会带来高昂的推理成本，并且常常引入冗余步骤，即过度思考现象。近期研究试图通过长度感知的奖励设计或基于提示的校准，开发能平衡推理长度与准确性的高效推理策略。然而，这些基于启发式的方法可能面临严重的准确性下降问题，并且对超参数非常敏感。为解决这些问题，我们提出了CRT（约束修正训练），这是一个基于参考引导约束优化的原则性后训练框架，为高效推理提供了更稳定且可解释的表述。CRT交替执行最小化推理长度与仅在性能低于参考水平时修正准确性的过程，从而实现对冗余推理的稳定有效剪枝。我们进一步通过两阶段训练方案扩展CRT：首先发现最短的可靠推理模式，然后在习得的长度预算下优化准确性，防止冗长CoT的重新出现。我们的综合评估表明，该框架能持续减少令牌使用量，同时将答案质量维持在稳健可靠的水平。进一步分析揭示，CRT不仅通过缩短响应长度，还通过降低内部语言冗余来提升推理效率，这催生了一种新的评估指标。此外，基于CRT的训练自然产生一系列中间检查点，这些检查点覆盖了不同解释长度范围且保持正确性，使得无需重新训练即可实现对推理详细程度的细粒度控制。