Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges. We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which we found to have a high inter-annotator agreement. Our study includes 9 judge models and 9 exam taker models -- both base and instruction-tuned. We assess the judge model's alignment across different model sizes, families, and judge prompts. Among other results, our research rediscovers the importance of using Cohen's kappa as a metric of alignment as opposed to simple percent agreement, showing that judges with high percent agreement can still assign vastly different scores. We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains, which have up to 34 points lower human alignment. Through error analysis and various other studies, including the effects of instruction length and leniency bias, we hope to provide valuable lessons for using LLMs as judges in the future.

翻译：LLM作为裁判的范式为解决人工评估的可扩展性挑战提供了一个有前景的解决方案，正迅速成为评估大型语言模型的一种流行方法。然而，关于该范式的优势与弱点，以及其可能存在的潜在偏见，仍有许多悬而未决的问题。本文对各种充当裁判的LLMs的性能进行了全面研究。我们利用TriviaQA作为基准，评估LLMs的客观知识推理能力，并将其与我们发现具有较高标注者间一致性的人工标注结果进行对比评估。我们的研究涵盖了9个裁判模型和9个应试模型——包括基础模型和指令微调模型。我们评估了裁判模型在不同模型规模、系列和裁判提示下的对齐情况。除其他结果外，我们的研究重新发现了使用Cohen's kappa作为对齐度量指标的重要性，而非简单的百分比一致性，研究表明，具有高百分比一致性的裁判仍可能给出差异巨大的评分。我们发现，Llama-3 70B和GPT-4 Turbo与人类具有极佳的对齐性，但在排序应试模型方面，它们均被JudgeLM-7B和词汇裁判Contains超越，而后两者的人类对齐度要低多达34个百分点。通过错误分析以及包括指令长度和宽松偏见影响在内的其他多项研究，我们希望能为未来使用LLMs作为裁判提供有价值的经验教训。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日