Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks by employing test-time scaling. However, they often generate over-long chains-of-thought that, driven by substantial reflections such as repetitive self-questioning and circular reasoning, lead to high token consumption, substantial computational overhead, and increased latency without improving accuracy, particularly in smaller models. Our observation reveals that increasing problem complexity induces more excessive and unnecessary reflection, which in turn reduces accuracy and increases token overhead. To address this challenge, we propose Adaptive Reflection and Length Coordinated Penalty (ARLCP), a novel reinforcement learning framework designed to dynamically balance reasoning efficiency and solution accuracy. ARLCP introduces two key innovations: (1) a reflection penalty that adaptively curtails unnecessary reflective steps while preserving essential reasoning, and (2) a length penalty calibrated to the estimated complexity of the problem. By coordinating these penalties, ARLCP encourages the model to generate more concise and effective reasoning paths. We evaluate our method on five mathematical reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B models. Experimental results show that ARLCP achieves a superior efficiency-accuracy trade-off compared to existing approaches. For the 1.5B model, it reduces the average response length by 53.1% while simultaneously improving accuracy by 5.8%. For the 7B model, it achieves a 35.0% reduction in length with a 2.7% accuracy gain. The code is released at https://github.com/ZeweiYu1/ARLCP .
翻译:大型推理模型(LRMs)通过采用测试时缩放,在复杂推理任务上展现出卓越性能。然而,它们通常会产生过长的思维链,这些思维链由大量反思(如重复自我质疑和循环推理)驱动,导致高令牌消耗、大量计算开销和延迟增加,却未能提升准确性,在较小模型中尤为明显。我们的观察表明,问题复杂度的增加会引发更多过度且不必要的反思,这反而会降低准确性并增加令牌开销。为应对这一挑战,我们提出了自适应反思与长度协调惩罚(ARLCP),这是一种新颖的强化学习框架,旨在动态平衡推理效率与解答准确性。ARLCP引入了两项关键创新:(1)一种自适应反思惩罚,能在保留必要推理的同时抑制不必要的反思步骤;(2)一种根据问题预估复杂度校准的长度惩罚。通过协调这些惩罚,ARLCP鼓励模型生成更简洁有效的推理路径。我们在五个数学推理基准上使用DeepSeek-R1-Distill-Qwen-1.5B和DeepSeek-R1-Distill-Qwen-7B模型评估了我们的方法。实验结果表明,与现有方法相比,ARLCP实现了更优的效率-准确性权衡。对于1.5B模型,它在平均响应长度降低53.1%的同时,将准确性提升了5.8%。对于7B模型,它实现了长度减少35.0%且准确性增益2.7%的效果。代码发布于https://github.com/ZeweiYu1/ARLCP。