Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models' own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.
翻译:大型语言模型(LLM)评判器常与传统基于算法的指标一同用于摘要等任务,因其能更好地捕捉语义信息、具备更强的推理能力,且对改写具有更高的鲁棒性。然而,LLM评判器存在对长度、顺序等因素的偏好,且易受各类对抗性输入提示的影响。尽管近期研究已关注这些偏差,但鲜有研究在细粒度层面结合明确定义的重叠度量进行分析。本研究针对摘要领域,以与人工撰写响应的重叠程度为函数,对LLM评判器偏差展开分析。我们测试了9个参数量从10亿到120亿不等的近期LLM模型,包括Gemma 3和LLaMA 3的多种变体。研究发现:当被评判摘要之间的相似度(通过ROUGE和BLEU度量)降低时,LLM评判器会逐渐倾向于选择其他LLM生成的摘要而非人工撰写的摘要;该模式在除一个模型外的所有测试模型中均存在,且不受模型自身位置偏差的影响。此外,我们发现模型甚至难以评判重叠程度有限的摘要,这表明在摘要领域采用LLM作为评判器时,应依赖超越简单比较的技术手段。