LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench .
翻译:基于大型语言模型(LLM)的评判者已成为人类评估的一种可扩展替代方案,并越来越多地用于评估、比较和改进模型。然而,基于LLM的评判者本身的可靠性却很少受到审视。随着LLM变得更加先进,其响应也愈发复杂,需要更强的评判者来评估它们。现有基准主要关注评判者与人类偏好的一致性,但往往未能涵盖更具挑战性的任务,在这些任务中,众包的人类偏好难以反映事实和逻辑的正确性。为解决这一问题,我们提出了一种新颖的评估框架,以客观评估基于LLM的评判者。基于此框架,我们提出了JudgeBench,这是一个用于评估基于LLM的评判者在知识、推理、数学和编码等领域的挑战性响应对上的基准。JudgeBench利用一种新颖的流程,将现有的困难数据集转化为具有反映客观正确性的偏好标签的挑战性响应对。我们对一系列基于提示的评判者、微调评判者、多智能体评判者和奖励模型进行的全面评估表明,JudgeBench比以往的基准带来了更大的挑战,许多强大模型(例如GPT-4o)的表现仅略优于随机猜测。总体而言,JudgeBench为评估日益先进的基于LLM的评判者提供了一个可靠的平台。数据和代码可在https://github.com/ScalerLab/JudgeBench获取。