Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a systematic study of overrating behavior in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores -- often with high confidence -- to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues. These results raise concerns about the usage of LLMs as drop-in replacements for human relevance assessors, and highlight the urgent need for careful diagnostic evaluation frameworks when applying LLMs for relevance assessments. Our code and results are publicly available.
翻译:人工相关性评估耗时且认知负荷大,限制了信息检索评估的可扩展性。这引发了使用大语言模型作为人工评估者替代方案的日益关注。然而,基于大语言模型的相关性判断是否足够可靠、稳定和严谨以匹配人类评估,仍是一个悬而未决的问题。本研究系统探究了基于大语言模型的相关性判断中的过评行为,涵盖模型架构、评估范式(逐点评估与成对评估)及文本修改策略。研究表明,模型会持续为未真正满足底层信息需求的文本分配虚高的相关性分数——通常带有高度置信度,这揭示了系统性的评估偏差而非随机波动。进一步的控制实验表明,基于大语言模型的相关性判断对文本长度和表层词汇线索高度敏感。这些发现对将大语言模型直接替代人工相关性评估者的做法提出了质疑,并凸显了在应用大语言模型进行相关性评估时建立严谨诊断评估框架的迫切需求。我们的代码与实验结果已公开。