Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoning and reasoning judges in reinforcement-learning-based LLM alignment. Our controlled synthetic setting, where a "gold-standard" judge (gpt-oss-120b) provides preference annotations to train smaller judges, reveals key differences between non-reasoning and reasoning judges: non-reasoning judges lead to reward hacking easily, while reasoning judges can lead to policies that achieve strong performance when evaluated by the gold-standard judge. Interestingly, we find that the reasoning-judge-trained policies achieve such strong performance by learning to generate highly effective adversarial outputs that can also score well on popular benchmarks such as Arena-Hard by deceiving other LLM-judges. Combined with our further analysis, our study highlights both important findings and room for improvements for applying (reasoning) LLM-judges in non-verifiable LLM post-training.
翻译:推理型LLM作为评判者能够受益于推理时扩展,为将推理模型的成功扩展到不可验证领域(即无法直接检验输出正确性/质量的领域)提供了一条有前景的路径。然而,尽管推理型评判者在静态评估基准上已展现出更优性能,但其在实际策略训练中的有效性尚未得到系统检验。为此,我们开展了一项严谨研究,以探究非推理型与推理型评判者在基于强化学习的LLM对齐中的实际影响。我们在受控合成环境中(由“黄金标准”评判者gpt-oss-120b提供偏好标注来训练较小的评判者),揭示了非推理型与推理型评判者的关键差异:非推理型评判者极易导致奖励破解,而推理型评判者则能引导策略在黄金标准评判者评估时取得强劲性能。有趣的是,我们发现经推理型评判者训练的策略之所以能达到如此强的性能,是因为其学会了生成极具效力的对抗性输出——这些输出通过欺骗其他LLM评判者,同样能在Arena-Hard等流行基准上获得高分。结合进一步分析,本研究既揭示了重要发现,也指出了在不可验证LLM后训练中应用(推理型)LLM评判者尚存的改进空间。