Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.
翻译:在开放式场景中评估大型语言模型(LLMs)具有挑战性,因为现有基准和指标无法全面衡量其能力。为解决这一问题,我们提出将LLMs微调为可扩展的裁判(JudgeLM),以在开放式基准中高效且有效地评估LLMs。我们首先构建了一个全面、大规模、高质量的数据集,包含任务种子、LLMs生成的答案以及GPT-4生成的评判,用于微调高性能裁判,同时建立了一个新的基准来评估裁判性能。我们训练了7B、13B到33B参数规模不同的JudgeLM,并对其能力与行为进行了系统性分析。随后,我们分析了将LLMs微调为裁判时的关键偏差,包括位置偏差、知识偏差和格式偏差。为解决这些问题,JudgeLM引入了一系列技术,包括交换增强、参考支持和参考丢弃,显著提升了裁判性能。JudgeLM在现有PandaLM基准和我们提出的新基准上均达到了最先进的裁判性能。我们的JudgeLM高效运行,其中JudgeLM-7B仅需3分钟即可在8台A100 GPU上评判5K个样本。JudgeLM与教师裁判的一致性高,一致性超过90%,甚至超越了人类之间的一致性。JudgeLM还扩展了其在单答案、多模态模型、多答案及多轮对话等场景中的裁判能力。