Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a study of \textit{overrating behavior} in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores -- often with high confidence -- to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues. These results raise concerns about the usage of LLMs as drop-in replacements for human relevance assessors, and highlight the urgent need for careful diagnostic evaluation frameworks when applying LLMs for relevance assessments. Our code and results are publicly available.
翻译:人工相关性评估耗时且认知负荷高,限制了信息检索评估的可扩展性。这促使研究者日益关注使用大语言模型(LLMs)作为人工评估的替代方案。然而,基于LLM的相关性判断在可靠性、稳定性和严谨性方面能否达到人类水平,仍是一个悬而未决的问题。本研究基于模型架构、评估范式(点式与成对式)及段落修改策略三个维度,系统探究了LLM相关性评估中的"高估行为"。研究表明,模型始终会对未能真正满足潜在信息需求的段落给出膨胀的相关性评分(通常伴随高置信度),这揭示了系统性的评估偏差而非随机波动。此外,控制实验表明,基于LLM的相关性判断对段落长度及表层词汇线索高度敏感。这些发现对直接使用LLM替代人工相关性评估员的做法提出警示,并凸显了在运用LLM进行相关性评估时亟需建立审慎的诊断性评估框架。我们的代码与结果已公开发布。