Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model's capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty-level average as a well-founded reference for length optimization. Extensive experiments on both in-domain and out-of-domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade-off between accuracy and length. The code is available at https://github.com/Yinan-Xia/DDPO.
翻译:大型推理模型(LRMs)展现出卓越的推理能力,但也存在过度思考的问题,常常生成冗余冗长的答案。对于超出模型能力范围的问题,LRMs易表现出过度自信现象,生成过短但错误的答案,进而导致性能次优。为解决这些问题,我们提出基于难度区分的策略优化(DDPO),这是一种高效的强化学习算法,它根据过度自信现象分别优化简单与复杂任务。具体而言,该算法在保证准确率的前提下缩短简单任务的输出长度,同时为复杂任务扩展探索空间以提升性能。我们进一步推导了最大化期望准确率的理论条件,要求长度分布尽可能接近最优长度并保持高度集中。基于这些条件,我们提出采用难度级别平均值作为长度优化的合理参考。在领域内与跨领域基准上的大量实验验证了DDPO的优越性和有效性。与GRPO相比,DDPO在多基准测试中将平均答案长度缩短12%,同时提升准确率1.85%,实现了准确率与长度之间的更优权衡。代码已开源至https://github.com/Yinan-Xia/DDPO。