Chain-of-thought reasoning in large language models can trigger an "overthinking trap": longer rollouts raise cost and latency yet often yield unreliable accuracy gains. Existing methods use global, static controls that may suppress needed reasoning. We propose mastery-gated, sample-level, soft reinforcement learning compression that penalizes long rollouts only when the model already solves the problem and has produced a shorter rollout. Across benchmarks, it cuts response length by 20-40% with comparable or higher accuracy and generalizes across domains: a model trained on math spontaneously shortens unseen tasks (code, instruction following, general-knowledge QA) without hurting accuracy. We further show two-way transfer between non-agent CoT and tool-use agents: non-agent training reduces SWE-Bench Verified rounds by 13%, while compressing a thinking agent cuts SWE trajectories by 67% tokens and 52% rounds and shortens non-agent outputs by up to 44%. Compression is thus not cosmetic brevity, but an inherent computation policy -- what to keep, and what to forget.
翻译:大型语言模型中的思维链推理可能引发"过度思考陷阱":更长的推理过程会增加计算成本和延迟,却往往无法带来可靠的准确率提升。现有方法采用全局静态控制策略,可能抑制必要的推理过程。我们提出一种基于掌握度门控、样本级、软性强化学习的压缩方法,该方法仅在模型已解决问题且已生成更短推理过程时,才对长推理过程施加惩罚。在多个基准测试中,该方法将响应长度缩减20-40%,同时保持相当或更高的准确率,并展现出跨领域泛化能力:在数学领域训练的模型能自发缩短未见任务(代码生成、指令跟随、常识问答)的推理长度,且不损害准确率。我们进一步展示了非智能体思维链与工具使用智能体之间的双向迁移:非智能体训练将SWE-Bench验证轮次减少13%,而压缩思考型智能体可使SWE轨迹的标记数减少67%、轮次减少52%,并将非智能体输出缩短达44%。因此,压缩并非表面简洁化,而是一种内在的计算策略——决定保留什么,遗忘什么。