The reasoning capabilities of large language models (LLMs) have improved substantially through increased test-time computation, typically in the form of intermediate tokens known as chain-of-thought (CoT). However, CoT often becomes unnecessarily long, increasing computation cost without actual accuracy gains or sometimes even degrading performance, a phenomenon known as ``overthinking''. We propose a multi-stage efficient reasoning method that combines supervised fine-tuning -- via rejection sampling or reasoning trace reformatting -- with reinforcement learning using an adaptive length penalty. We introduce a lightweight reward function that penalizes tokens generated after the first correct answer but encouraging self-verification only when beneficial. We conduct a holistic evaluation across seven diverse reasoning tasks, analyzing the accuracy--response length trade-off. Our approach reduces response length by an average of 28\% for 8B models and 40\% for 32B models, while incurring only minor performance drops of 1.6 and 2.5 points, respectively. Despite its conceptual simplicity, it achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods, scoring 76.6, in terms of the area under the Overthinking-Adjusted Accuracy curve ($\text{AUC}_{\text{OAA}}$) -- 5 points above the base model and 2.5 points above the second-best approach.
翻译:大型语言模型(LLM)的推理能力通过增加测试时的计算量(通常以中间标记的形式,即思维链(CoT))得到了显著提升。然而,CoT往往变得不必要地冗长,增加了计算成本却未带来实际的准确率提升,有时甚至导致性能下降,这种现象被称为“过度思考”。我们提出了一种多阶段高效推理方法,该方法结合了监督微调(通过拒绝采样或推理轨迹重构)与使用自适应长度惩罚的强化学习。我们引入了一种轻量级的奖励函数,该函数对首个正确答案之后生成的标记进行惩罚,但仅在有益时鼓励自我验证。我们在七个不同的推理任务上进行了全面评估,分析了准确率与响应长度之间的权衡。我们的方法使8B模型的响应长度平均减少了28%,32B模型减少了40%,而准确率仅分别轻微下降了1.6和2.5个百分点。尽管概念简单,但与更复杂的先进高效推理方法相比,该方法在过度思考调整准确率曲线下面积($\text{AUC}_{\text{OAA}}$)方面取得了更优的权衡,得分达到76.6——比基础模型高出5个百分点,比次优方法高出2.5个百分点。