Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations at the query-document level. By reformulating the metric-integration space, we reduced the computational complexity from O(2^|C|) to O(2^K), where |C| represents corpus size (in order of millions). Detailed experiments across prominent retrieval datasets demonstrate that our method reduces the variance of estimates for the business-critical Precision@K metric, while effectively correcting for LLM bias in low-resource settings.
翻译:传统上,评估搜索、排序和检索增强生成系统的质量需要大量人工相关性标注。近年来,多个实际部署系统探索使用大语言模型作为该任务的自动评判器,但其固有偏差阻碍了直接用于指标估计。本文提出一个扩展预测驱动推断的统计框架,通过结合少量人工标注与大语言模型判断,生成需要子实例标注的指标可靠估计。该方法仅需100条人工标注查询和10,000个未标注样本,较传统方法显著降低标注需求。我们将所提框架(PRECISE)应用于基于大语言模型的查询重构应用的相关性提升推断,将预测驱动推断扩展至查询-文档层级的子实例标注。通过重构指标积分空间,我们将计算复杂度从O(2^|C|)降至O(2^K),其中|C|表示语料库规模(百万量级)。在多个主流检索数据集上的详细实验表明,该方法能降低关键业务指标Precision@K的估计方差,并在低资源场景中有效校正大语言模型偏差。