Retrieval effectiveness varies substantially across queries, making it important to estimate ranking quality before relevance judgments are available. Query performance prediction (QPP) addresses this need, but most existing methods rely on external predictors after retrieval or reranking. In this paper, we study \textit{reranker-internal QPP}: can an LLM reranker estimate the quality of the ranking it has just produced? We investigate both training-free and training-based approaches. For training-free estimation, we examine metric-specific self-consistency across sampled rankings and verbalized confidence produced directly by the reranker. Experiments on TREC Deep Learning 2019--2022 with four LLMs show that self-consistency is competitive with the state-of-the-art (SOTA) approach and better calibrated in almost all settings, while direct verbalized confidence is severely overconfident. To improve verbalized confidence, we propose two supervised methods, Verb-Num and Verb-List, which enable LLM rerankers to produce calibrated ranking-quality estimates with only a few additional output tokens.
翻译:检索效果在不同查询之间存在显著差异,因此在获得相关性标注前评估排序质量至关重要。查询性能预测(QPP)可满足这一需求,但现有方法大多依赖检索或重排序后的外部预测器。本文研究重排序器内部QPP问题:LLM重排序器能否评估其刚生成的排序质量?我们同时探索了无训练与基于训练的方法。针对无训练评估,我们考察了跨采样排序的特定指标自一致性以及重排序器直接生成的显式置信度。在TREC 2019-2022深度学习数据集上使用四个LLM的实验表明,自一致性在几乎所有设置下均与最先进方法(SOTA)性能相当且校准性更优,而直接显式置信度存在严重过度自信问题。为改进显式置信度,我们提出两种监督方法——Verb-Num与Verb-List,使LLM重排序器仅需增加少量输出标记即可生成校准的排序质量估计。