The inference overhead induced by redundant reasoning undermines the interactive experience and severely bottlenecks the deployment of Large Reasoning Models. Existing reinforcement learning (RL)-based solutions tackle this problem by coupling a length penalty with outcome-based rewards. This simplistic reward weighting struggles to reconcile brevity with accuracy, as enforcing brevity may compromise critical reasoning logic. In this work, we address this limitation by proposing a multi-agent RL framework that selectively penalizes redundant chunks, while preserving essential reasoning logic. Our framework, Self-Compression via MARL (SCMA), instantiates redundancy detection and evaluation through two specialized agents: \textbf{a Segmentation Agent} for decomposing the reasoning process into logical chunks, and \textbf{a Scoring Agent} for quantifying the significance of each chunk. The Segmentation and Scoring agents collaboratively define an importance-weighted length penalty during training, incentivizing \textbf{a Reasoning Agent} to prioritize essential logic without introducing inference overhead during deployment. Empirical evaluations across model scales demonstrate that SCMA reduces response length by 11.1\% to 39.0\% while boosting accuracy by 4.33\% to 10.02\%. Furthermore, ablation studies and qualitative analysis validate that the synergistic optimization within the MARL framework fosters emergent behaviors, yielding more powerful LRMs compared to vanilla RL paradigms.
翻译:推理过程中冗余思维链所引发的推理开销会损害交互体验,并严重制约大型推理模型的部署。现有的基于强化学习的解决方案通过将长度惩罚与结果奖励相结合来处理此问题。这种简单的奖励加权机制难以在简洁性与准确性之间取得平衡,因为强制简洁可能会损害关键的推理逻辑。本研究通过提出一种多智能体强化学习框架来解决这一局限,该框架选择性地惩罚冗余推理片段,同时保留必要的推理逻辑。我们提出的框架——基于多智能体强化学习的自压缩方法,通过两个专用智能体实例化冗余检测与评估:\textbf{分割智能体}负责将推理过程分解为逻辑片段,\textbf{评分智能体}负责量化每个片段的重要性。在训练过程中,分割智能体与评分智能体协同定义重要性加权的长度惩罚,激励\textbf{推理智能体}优先处理核心逻辑,同时在部署阶段不引入额外推理开销。跨模型规模的实证评估表明,SCMA 将响应长度减少了 11.1\% 至 39.0\%,同时将准确率提升了 4.33\% 至 10.02\%。此外,消融研究与定性分析证实,MARL 框架内的协同优化促进了涌现行为,相比传统强化学习范式,能够产生更强大的大型推理模型。