Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.
翻译:强化学习显著提升了大型语言模型的推理能力,但同时也倾向于延长思维链输出,增加训练和推理过程中的计算成本。尽管已有长度控制方法被提出,但如何平衡效率与性能的最优输出长度仍不明确。在本工作中,我们在两个模型(Qwen3-1.7B Base 和 DeepSeek-R1-Distill-Qwen-1.5B)上比较了多种长度控制方法。结果表明,长度惩罚可能阻碍推理能力的习得,而对具有较强先验推理能力的模型进行适当调优的长度控制则可提升效率。通过将先前工作扩展至强化学习训练的策略,我们识别出两种失效模式:1)长输出增加离散度,2)短输出导致思考不足。