Critique-RL：通过两阶段强化学习训练用于批判的语言模型 (Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning)

Zhiheng Xi,Jixuan Huang,Xin Guo,Boyang Hong,Dingwen Yang,Xiaoran Fan,Shuo Li,Zehui Chen,Junjie Ye,Siyu Yuan,Zhengyin Du,Xuesong Yao,Yufei Xu,Jiecao Chen,Rui Zheng,Tao Gui,Qi Zhang,Xuanjing Huang

from arxiv, Preprint, 25 pages, 9 figures. Code: https://github.com/WooooDyy/Critique-RL

Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.

翻译：训练批判性语言模型以评估并提供对模型输出的反馈，是提升大型语言模型在复杂推理任务中性能的一种有前景的方法。然而，现有方法通常依赖更强的监督者来标注批判数据。为解决此问题，我们提出了Critique-RL，一种无需强监督的在线强化学习方法，用于开发批判性语言模型。我们的方法基于双智能体范式：执行者生成响应，批判者提供反馈，执行者据此优化响应。我们首先揭示，仅依赖执行者输出的间接奖励信号进行强化学习优化，往往导致批判者表现不佳：尽管其帮助性（即提供建设性反馈）有所提升，但判别能力（即判断响应是否高质量）仍然较差，导致性能增益有限。为克服此问题，Critique-RL采用两阶段优化策略。在第一阶段，它通过基于规则的直接奖励信号强化批判者的判别能力；在第二阶段，它引入基于执行者优化的间接奖励以提升批判者的帮助性，同时通过适当的正则化保持其判别能力。跨多种任务和模型的广泛实验表明，Critique-RL带来了显著的性能提升。例如，在Qwen2.5-7B模型上，其在领域内任务实现了9.02%的提升，在领域外任务实现了5.70%的提升，凸显了其潜力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日