JudgeBench: A Benchmark for Evaluating LLM-based Judges

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench .

翻译：基于大型语言模型（LLM）的评判者已成为人类评估的一种可扩展替代方案，并越来越多地用于评估、比较和改进模型。然而，基于LLM的评判者本身的可靠性却很少受到审视。随着LLM变得更加先进，其响应也愈发复杂，需要更强的评判者来评估它们。现有基准主要关注评判者与人类偏好的一致性，但往往未能涵盖更具挑战性的任务，在这些任务中，众包的人类偏好难以反映事实和逻辑的正确性。为解决这一问题，我们提出了一种新颖的评估框架，以客观评估基于LLM的评判者。基于此框架，我们提出了JudgeBench，这是一个用于评估基于LLM的评判者在知识、推理、数学和编码等领域的挑战性响应对上的基准。JudgeBench利用一种新颖的流程，将现有的困难数据集转化为具有反映客观正确性的偏好标签的挑战性响应对。我们对一系列基于提示的评判者、微调评判者、多智能体评判者和奖励模型进行的全面评估表明，JudgeBench比以往的基准带来了更大的挑战，许多强大模型（例如GPT-4o）的表现仅略优于随机猜测。总体而言，JudgeBench为评估日益先进的基于LLM的评判者提供了一个可靠的平台。数据和代码可在https://github.com/ScalerLab/JudgeBench获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日