Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.
翻译:训练批判性语言模型以评估并提供对模型输出的反馈,是提升大型语言模型在复杂推理任务中性能的一种有前景的方法。然而,现有方法通常依赖更强的监督者来标注批判数据。为解决此问题,我们提出了Critique-RL,一种无需强监督的在线强化学习方法,用于开发批判性语言模型。我们的方法基于双智能体范式:执行者生成响应,批判者提供反馈,执行者据此优化响应。我们首先揭示,仅依赖执行者输出的间接奖励信号进行强化学习优化,往往导致批判者表现不佳:尽管其帮助性(即提供建设性反馈)有所提升,但判别能力(即判断响应是否高质量)仍然较差,导致性能增益有限。为克服此问题,Critique-RL采用两阶段优化策略。在第一阶段,它通过基于规则的直接奖励信号强化批判者的判别能力;在第二阶段,它引入基于执行者优化的间接奖励以提升批判者的帮助性,同时通过适当的正则化保持其判别能力。跨多种任务和模型的广泛实验表明,Critique-RL带来了显著的性能提升。例如,在Qwen2.5-7B模型上,其在领域内任务实现了9.02%的提升,在领域外任务实现了5.70%的提升,凸显了其潜力。