Benchmark collections have long enabled controlled comparison and cumulative progress in Information Retrieval (IR). However, prior meta-analyses have shown that reported effectiveness gains often fail to accumulate, in part due to the use of weak or outdated baselines. While large language models are increasingly used in retrieval pipelines, their impact on established IR benchmarks has not been systematically analyzed. In this study, we analyze 143 publications reporting results on the TREC Robust04 collection and the TREC Deep Learning 2020 (DL20) passage retrieval benchmark to examine longitudinal trends in retrieval effectiveness and baseline strength. We observe what we term an \emph{LLM effect}: recent systems incorporating LLM components achieve 8.8\% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20\% higher on Robust04 since 2023. However, adapting a data contamination detection approach to reranking reveals measurable contamination in both benchmarks. While excluding contaminated topics reduces effectiveness, confidence intervals remain wide, making it difficult to determine whether the LLM effect reflects genuine methodological advances or memorization from pretraining data.
翻译:基准测试集合长期以来在信息检索(IR)领域实现了受控比较与累积性进展。然而,先前的元分析表明,报告中的效能提升往往难以累积,部分原因在于使用了薄弱或过时的基准线。尽管大型语言模型在检索流程中的应用日益广泛,但其对现有IR基准测试的系统性影响尚未得到分析。本研究分析了143篇在TREC Robust04集合与TREC深度学习2020(DL20)段落检索基准上报告结果的论文,以探究检索效能与基准线强度的长期趋势。我们观察到所谓的“AI效应”:与TREC 2020最佳结果相比,近期集成AI组件的系统在DL20上的nDCG@10提升了8.8%,而自2023年起在Robust04上的提升幅度约为20%。然而,将数据污染检测方法适用于重排序后,发现两个基准测试均存在可测量的污染。尽管排除受污染主题会降低效能,但置信区间仍然较宽,难以判断AI效应究竟反映了真正的方法论进步,还是源于预训练数据的记忆。