The zero-shot capability of Large Language Models (LLMs) has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.
翻译:大型语言模型(LLM)的零样本能力使得为各类任务开发高度灵活、无需参考的度量标准成为可能,从而使LLM评估者成为自然语言处理中的常用工具。然而,这些LLM评估者的鲁棒性仍相对研究不足;现有工作主要追求优化LLM评分与人类专家评分之间相关性方面的最佳性能。本文利用SummEval数据集进行了一系列分析,证实了LLM是有偏见的评估者,具体表现为:(1)存在熟悉度偏差——偏好困惑度较低的文本;(2)评分分布呈现偏斜和有偏性;(3)在多属性判断中产生锚定效应。我们还发现LLM是不一致的评估者,表现出较低的“样本间”一致性,且对提示差异敏感,而这些差异对人类理解文本质量而言并不显著。此外,我们分享了配置LLM评估者以缓解这些局限性的方法。在RoSE数据集上的实验结果表明,所提方法相比现有最优的LLM评估者取得了性能提升。