The "LLM-as-a-Judge" paradigm, using Large Language Models (LLMs) as automated evaluators, is pivotal to LLM development, offering scalable feedback for complex tasks. However, the reliability of these judges is compromised by various biases. Existing research has heavily concentrated on biases in comparative evaluations. In contrast, scoring-based evaluations-which assign an absolute score and are often more practical in industrial applications-remain under-investigated. To address this gap, we undertake the first dedicated examination of scoring bias in LLM judges. We shift the focus from biases tied to the evaluation targets to those originating from the scoring prompt itself. We formally define scoring bias and identify three novel, previously unstudied types: rubric order bias, score ID bias, and reference answer score bias. We propose a comprehensive framework to quantify these biases, featuring a suite of multi-faceted metrics and an automatic data synthesis pipeline to create a tailored evaluation corpus. Our experiments empirically demonstrate that even the most advanced LLMs suffer from these substantial scoring biases. Our analysis yields actionable insights for designing more robust scoring prompts and mitigating these newly identified biases.
翻译:“LLM作为裁判”范式利用大型语言模型作为自动化评估器,对LLM发展至关重要,为复杂任务提供了可扩展的反馈。然而,这些裁判的可靠性受到多种偏差的影响。现有研究高度集中于比较评估中的偏差。相比之下,基于评分的评估——即分配绝对分数,在工业应用中通常更为实用——仍未得到充分研究。为填补这一空白,我们首次对LLM裁判的评分偏差进行了专门研究。我们将关注点从与评估目标相关的偏差转向源自评分提示本身的偏差。我们正式定义了评分偏差,并识别出三种新颖且先前未被研究的类型:评分标准顺序偏差、分数标识偏差和参考答案分数偏差。我们提出了一个量化这些偏差的综合框架,包含一套多维度指标和自动数据合成流程,以构建定制的评估语料库。实验经验证明,即使是最先进的LLM也存在这些显著的评分偏差。我们的分析为设计更稳健的评分提示和缓解这些新发现的偏差提供了可操作的见解。