The standard practice of query performance prediction (QPP) evaluation is to measure a set-level correlation between the estimated retrieval qualities and the true ones. However, neither this correlation-based evaluation measure quantifies QPP effectiveness at the level of individual queries, nor does this connect to a downstream application, meaning that QPP methods yielding high correlation values may not find a practical application in query-specific decisions in an IR pipeline. In this paper, we propose a downstream-focussed evaluation framework where a distribution of QPP estimates across a list of top-documents retrieved with several rankers is used as priors for IR fusion. While on the one hand, a distribution of these estimates closely matching that of the true retrieval qualities indicates the quality of the predictor, their usage as priors on the other hand indicates a predictor's ability to make informed choices in an IR pipeline. Our experiments firstly establish the importance of QPP estimates in weighted IR fusion, yielding substantial improvements of over 4.5% over unweighted CombSUM and RRF fusion strategies, and secondly, reveal new insights that the downstream effectiveness of QPP does not correlate well with the standard correlation-based QPP evaluation.
翻译:查询性能预测(QPP)评估的标准实践是衡量估计检索质量与真实检索质量之间的集合级相关性。然而,这种基于相关性的评估指标既无法量化QPP在单个查询层面的有效性,也未与下游应用建立关联,这意味着获得高相关性值的QPP方法可能在信息检索流程的查询特定决策中缺乏实际应用价值。本文提出一种聚焦下游应用的评估框架,其中将多个排序器检索得到的顶部文档列表中QPP估计值的分布作为信息检索融合的先验信息。一方面,这些估计值的分布与真实检索质量的分布高度匹配可反映预测器的质量;另一方面,将其作为先验信息使用则能体现预测器在信息检索流程中做出明智决策的能力。我们的实验首先证实了QPP估计在加权信息检索融合中的重要性——相较于未加权的CombSUM和RRF融合策略实现了超过4.5%的显著提升;其次揭示了新的发现:QPP的下游应用效能与标准的基于相关性的QPP评估指标之间并不具有良好的一致性。