Evaluation in Information Retrieval relies on post-hoc empirical procedures, which are time-consuming and expensive operations. To alleviate this, Query Performance Prediction (QPP) models have been developed to estimate the performance of a system without the need for human-made relevance judgements. Such models, usually relying on lexical features from queries and corpora, have been applied to traditional sparse IR methods - with various degrees of success. With the advent of neural IR and large Pre-trained Language Models, the retrieval paradigm has significantly shifted towards more semantic signals. In this work, we study and analyze to what extent current QPP models can predict the performance of such systems. Our experiments consider seven traditional bag-of-words and seven BERT-based IR approaches, as well as nineteen state-of-the-art QPPs evaluated on two collections, Deep Learning '19 and Robust '04. Our findings show that QPPs perform statistically significantly worse on neural IR systems. In settings where semantic signals are prominent (e.g., passage retrieval), their performance on neural models drops by as much as 10% compared to bag-of-words approaches. On top of that, in lexical-oriented scenarios, QPPs fail to predict performance for neural IR systems on those queries where they differ from traditional approaches the most.
翻译:信息检索领域的评估依赖于事后经验性流程,这些流程既耗时又昂贵。为缓解这一问题,查询性能预测(QPP)模型被开发出来,用于在无需人工相关性判断的情况下估计系统性能。这类模型通常利用查询和语料库中的词汇特征,已被应用于传统的稀疏信息检索方法,并取得了不同程度的成功。随着神经信息检索和大规模预训练语言模型的出现,检索范式已显著转向更语义化的信号。在本研究中,我们探讨并分析了现有QPP模型能在多大程度上预测此类系统的性能。我们的实验涵盖了七种传统的词袋方法和七种基于BERT的信息检索方法,以及十九种最先进的QPP模型,并在Deep Learning '19和Robust '04两个数据集上进行了评估。研究结果表明,QPP模型在神经信息检索系统上的表现统计上显著更差。在语义信号突出的场景中(例如段落检索),与词袋方法相比,它们对神经模型的性能下降了多达10%。此外,在词汇导向的场景中,对于那些神经检索系统与传统方法差异最大的查询,QPP模型无法预测其性能。