Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges. We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which we found to have a high inter-annotator agreement. Our study includes 9 judge models and 9 exam taker models -- both base and instruction-tuned. We assess the judge model's alignment across different model sizes, families, and judge prompts. Among other results, our research rediscovers the importance of using Cohen's kappa as a metric of alignment as opposed to simple percent agreement, showing that judges with high percent agreement can still assign vastly different scores. We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains, which have up to 34 points lower human alignment. Through error analysis and various other studies, including the effects of instruction length and leniency bias, we hope to provide valuable lessons for using LLMs as judges in the future.
翻译:LLM作为裁判的范式为解决人工评估的可扩展性挑战提供了一个有前景的解决方案,正迅速成为评估大型语言模型的一种流行方法。然而,关于该范式的优势与弱点,以及其可能存在的潜在偏见,仍有许多悬而未决的问题。本文对各种充当裁判的LLMs的性能进行了全面研究。我们利用TriviaQA作为基准,评估LLMs的客观知识推理能力,并将其与我们发现具有较高标注者间一致性的人工标注结果进行对比评估。我们的研究涵盖了9个裁判模型和9个应试模型——包括基础模型和指令微调模型。我们评估了裁判模型在不同模型规模、系列和裁判提示下的对齐情况。除其他结果外,我们的研究重新发现了使用Cohen's kappa作为对齐度量指标的重要性,而非简单的百分比一致性,研究表明,具有高百分比一致性的裁判仍可能给出差异巨大的评分。我们发现,Llama-3 70B和GPT-4 Turbo与人类具有极佳的对齐性,但在排序应试模型方面,它们均被JudgeLM-7B和词汇裁判Contains超越,而后两者的人类对齐度要低多达34个百分点。通过错误分析以及包括指令长度和宽松偏见影响在内的其他多项研究,我们希望能为未来使用LLMs作为裁判提供有价值的经验教训。