Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only "simple" samples, defined by the model's consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.


翻译:通过基于结果正确性的强化学习(RL),具有规模化推理计算能力的大型推理模型(LRMs)在复杂推理任务上取得了显著成功。然而,仅关注最终正确性的单方面奖励限制了其对内部推理过程提供细致监督的能力。这一缺陷导致内部推理质量欠佳,表现为过度思考、思考不足、冗余思考和思维混乱等问题。受近期LRM自我奖励研究的启发,我们引入了自我重写框架,其中模型对其自身的推理文本进行重写,随后从重写后的推理中学习以提升内部思维过程质量。在算法设计上,我们提出了一种选择性重写方法:仅对模型持续正确性定义的“简单”样本进行重写,从而完整保留GRPO的所有原始奖励信号。在实际实现中,我们将重写与原始生成整合在单个批次内,保持了RL算法的可扩展性,并仅引入约10%的额外开销。在不同模型规模和多样化任务上的大量实验验证了自我重写的有效性。在准确率与推理长度的权衡方面,即使未在重写提示中明确要求缩短推理长度,自我重写方法仍能以显著缩短的推理长度(-46%)实现更高的准确率(+0.6),优于现有强基线方法。在内部推理质量方面,基于LLM-as-a-judge评估指标,自我重写获得了显著更高的评分(+7.2),有效缓解了内部推理缺陷。

0
下载
关闭预览

相关内容

ACM/IEEE第23届模型驱动工程语言和系统国际会议,是模型驱动软件和系统工程的首要会议系列,由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来,模型涵盖了建模的各个方面,从语言和方法到工具和应用程序。模特的参加者来自不同的背景,包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛,参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会,并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。 官网链接:http://www.modelsconference.org/
Top
微信扫码咨询专知VIP会员