Large language models (LLMs) show strong reasoning abilities but often produce unnecessarily long explanations that reduce efficiency. Although reinforcement learning (RL) has been used to improve reasoning, most methods focus on accuracy and rely on uniform length-based rewards that overlook the differing contributions of individual tokens, often harming correctness. We revisit length optimization in RL through the perspective of token significance. Observing that many chain-of-thought (CoT) tokens contribute little to the final answer, we introduce a significance-aware length reward that selectively penalizes insignificance tokens, reducing redundancy while preserving essential reasoning. We also propose a dynamic length reward that encourages more detailed reasoning early in training and gradually shifts toward conciseness as learning progresses. Integrating these components into standard policy optimization yields a framework that improves both reasoning efficiency and accuracy. Experiments across multiple benchmarks demonstrate substantial reductions in response length while preserving or improving correctness, highlighting the importance of modeling token significance for efficient LLM reasoning.
翻译:大型语言模型(LLM)展现出强大的推理能力,但往往会产生不必要的冗长解释,从而降低效率。尽管强化学习(RL)已被用于改进推理,但大多数方法侧重于准确性,并依赖基于长度的统一奖励,忽略了单个token的不同贡献,这常常损害正确性。我们从token重要性的角度重新审视RL中的长度优化。观察到许多思维链(chain-of-thought, CoT)token对最终答案贡献甚微,我们引入了一种考虑重要性的长度奖励,选择性地惩罚不重要的token,在减少冗余的同时保留必要的推理过程。我们还提出了一种动态长度奖励,在训练初期鼓励更详细的推理,并随着学习进程逐渐转向简洁性。将这些组件整合到标准策略优化中,形成了一个既能提高推理效率又能保证准确性的框架。跨多个基准的实验表明,该方法在保持或提高正确性的同时,显著缩短了响应长度,凸显了建模token重要性对于高效LLM推理的重要性。