A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with $\geqslant$ 32B parameters, and a few smaller models like Qwen2.5 14B.
翻译:大型语言模型(LLM)作为评判者,通过分析受害者机器学习(ML)模型(尤其是LLM)的输出,评估其质量。LLM作为评判者的实现包含一个模型以及一个专门设计的评判提示,该提示包含分析标准。由此实现的自动化分析,相较人工评审员,能以更快速且更一致的评判方式,扩展对受害者模型自由文本输出的复杂评估规模。因此,对LLM的质量与安全性评估可覆盖受害者模型的大量用例。由于LLM作为评判者是一项相对较新的技术,其在可靠性与人类评判一致性方面尚缺乏深入研究。本工作评估了LLM作为受害者LLM自动化质量评估者的适用性。我们测试了37个不同规模的对话式LLM与5种不同评判提示的组合效果、二级评判者的概念,以及5个针对该任务微调的评估模型。作为评估目标,我们为八类不同的评判任务整理了数据集,并基于人类评估提供了对应的真实标签。我们的实证结果表明,当结合合适的提示时,LLM作为评判者与人类评估具有高度相关性,尤其对于GPT-4o、多个参数≥32B的开源模型以及少数较小模型(如Qwen2.5 14B)表现显著。