LLM-based relevance assessment still can't replace human relevance assessment

The use of large language models (LLMs) for relevance assessment in information retrieval has gained significant attention, with recent studies suggesting that LLM-based judgments provide comparable evaluations to human judgments. Notably, based on TREC 2024 data, Upadhyay et al make a bold claim that LLM-based relevance assessments, such as those generated by the Umbrela system, can fully replace traditional human relevance assessments in TREC-style evaluations. This paper critically examines this claim, highlighting practical and theoretical limitations that undermine the validity of this conclusion. First, we question whether the evidence provided by Upadhyay et al. genuinely supports their claim, particularly when the test collection is intended to serve as a benchmark for future research innovations.Second, we submit a system deliberately crafted to exploit automatic evaluation metrics, demonstrating that it can achieve artificially inflated scores without truly improving retrieval quality. Third, we simulate the consequences of circularity by analyzing Kendall's tau correlations under the hypothetical scenario in which all systems adopt Umbrela as a final-stage re-ranker, illustrating how reliance on LLM-based assessments can distort system rankings. Theoretical challenges - including the inherent narcissism of LLMs, the risk of overfitting to LLM-based metrics, and the potential degradation of future LLM performance - that must be addressed before LLM-based relevance assessments can be considered a viable replacement for human judgments.

翻译：在信息检索领域，利用大语言模型（LLMs）进行相关性评估已引起广泛关注，近期研究表明基于LLM的判断可提供与人工判断相当的评估结果。值得注意的是，基于TREC 2024数据，Upadhyay等人提出了一项大胆主张：在TREC式评估中，基于LLM的相关性评估（如Umbrela系统生成的评估）可完全取代传统的人工相关性评估。本文对该主张进行批判性审视，指出其结论有效性所面临的实际与理论局限。首先，我们质疑Upadhyay等人提供的证据是否真正支持其主张，特别是在测试集旨在作为未来研究创新基准的情况下。其次，我们提交了一个专门为利用自动评估指标而设计的系统，证明其可在未真正提升检索质量的情况下获得虚高的评分。第三，我们通过分析所有系统均采用Umbrela作为最终阶段重排器的假设情境下的Kendall's tau相关系数，模拟循环论证可能导致的后果，阐明依赖基于LLM的评估如何扭曲系统排序。最后，我们讨论必须解决的理论挑战——包括LLMs固有的自恋性、对基于LLM指标过拟合的风险，以及未来LLM性能可能退化的问题——这些挑战均需在考虑将基于LLM的相关性评估作为人工判断的可行替代方案前予以充分考量。