Dealing with unjudged documents ("holes") in relevance assessments is a perennial problem when evaluating search systems with offline experiments. Holes can reduce the apparent effectiveness of retrieval systems during evaluation and introduce biases in models trained with incomplete data. In this work, we explore whether large language models can help us fill such holes to improve offline evaluations. We examine an extreme, albeit common, evaluation setting wherein only a single known relevant document per query is available for evaluation. We then explore various approaches for predicting the relevance of unjudged documents with respect to a query and the known relevant document, including nearest neighbor, supervised, and prompting techniques. We find that although the predictions of these One-Shot Labelers (1SLs) frequently disagree with human assessments, the labels they produce yield a far more reliable ranking of systems than the single labels do alone. Specifically, the strongest approaches can consistently reach system ranking correlations of over 0.85 with the full rankings over a variety of measures. Meanwhile, the approach substantially reduces the false positive rate of t-tests due to holes in relevance assessments (from 15-30% down to under 5%), giving researchers more confidence in results they find to be significant.
翻译:在离线实验评估搜索系统时,相关性评估中未判断文档(即"空洞")始终是一个长期存在的问题。空洞不仅会降低评估过程中检索系统的表面有效性,还会在使用不完整数据训练的模型中引入偏差。本研究探索大型语言模型能否帮助填补此类空洞以改进离线评估。我们研究了一种极端但常见的评估场景:每个查询仅有单个已知相关文档可用于评估。随后,我们探索了多种基于查询与已知相关文档预测未判断文档相关性的方法,包括最近邻方法、监督学习方法和提示工程技术。研究发现,尽管这些单样本标注器(1SLs)的预测结果常与人工评估存在分歧,但其所生成的标签能比单独使用单一标签更可靠地对系统进行排序。具体而言,最强方法在各种评估指标下的系统排序相关系数可持续超过0.85。同时,该方法大幅降低了因相关性评估空洞导致的t检验假阳性率(从15-30%降至5%以下),使研究者对发现的显著性结果更具信心。