Most reinforcement learning (RL) methods for training large language models (LLMs) require ground-truth labels or task-specific verifiers, limiting scalability when correctness is ambiguous or expensive to obtain. We introduce Reinforcement Learning from Meta-Evaluation (RLME), which optimizes a generator using reward derived from an evaluator's answers to natural-language meta-questions (e.g., "Is the answer correct?" or "Is the reasoning logically consistent?"). RLME treats the evaluator's probability of a positive judgment as a reward and updates the generator via group-relative policy optimization, enabling learning without labels. Across a suite of experiments, we show that RLME achieves accuracy and sample efficiency comparable to label-based training, enables controllable trade-offs among multiple objectives, steers models toward reliable reasoning patterns rather than post-hoc rationalization, and generalizes to open-domain settings where ground-truth labels are unavailable, broadening the domains in which LLMs may be trained with RL.
翻译:大多数用于训练大型语言模型(LLMs)的强化学习(RL)方法都需要真实标签或特定任务的验证器,这在正确性难以界定或获取成本高昂时限制了方法的可扩展性。我们提出了基于元评估的强化学习(RLME),该方法通过利用评估者对自然语言元问题(例如“答案是否正确?”或“推理是否逻辑一致?”)的回答所衍生的奖励来优化生成器。RLME将评估者给出肯定判断的概率视为奖励,并通过组相对策略优化来更新生成器,从而实现无需标签的学习。在一系列实验中,我们证明RLME在准确性和样本效率上可与基于标签的训练相媲美,能够实现多目标间的可控权衡,引导模型形成可靠的推理模式而非事后合理化,并能推广到真实标签不可用的开放域场景,从而拓宽了可使用强化学习训练LLMs的领域范围。