Test collections are an integral part of Information Retrieval (IR) research. They allow researchers to evaluate and compare ranking algorithms in a quick, easy and reproducible way. However, constructing these datasets requires great efforts in manual labelling and logistics, and having only few human relevance judgements can introduce biases in the comparison. Recent research has explored the use of Large Language Models (LLMs) for labelling the relevance of documents for building new retrieval test collections. Their strong text-understanding capabilities and low cost compared to human-made judgements makes them an appealing tool for gathering relevance judgements. Results suggest that LLM-generated labels are promising for IR evaluation in terms of ranking correlation, but nothing is said about the implications in terms of statistical significance. In this work, we look at how LLM-generated judgements preserve the same pairwise significance evaluation as human judgements. Our results show that LLM judgements detect most of the significant differences while maintaining acceptable numbers of false positives. However, we also show that some systems are treated differently under LLM-generated labels, suggesting that evaluation with LLM judgements might not be entirely fair. Our work represents a step forward in the evaluation of statistical testing results provided by LLM judgements. We hope that this will serve as a basis for other researchers to develop reliable models for automatic relevance assessments.
翻译:测试集是信息检索(IR)研究不可或缺的组成部分。它们使研究人员能够以快速、简便且可重复的方式评估和比较排序算法。然而,构建这些数据集需要大量的人工标注与组织工作,且仅依靠少量人工相关性判断可能会在比较中引入偏差。近期研究探索了利用大型语言模型(LLMs)标注文档相关性以构建新检索测试集的方法。相较于人工判断,LLMs强大的文本理解能力和较低的成本使其成为收集相关性判断的有力工具。已有结果表明,在排序相关性方面,LLM生成的标签对IR评估具有良好前景,但尚未探讨其在统计显著性方面的影响。本研究旨在探究LLM生成的判断是否能够保持与人工判断相同的成对显著性评估效果。我们的研究结果表明,LLM判断能够检测出大部分显著性差异,同时将错误发现率控制在可接受范围内。然而,我们也发现某些系统在LLM生成的标签下会受到不同对待,这表明基于LLM判断的评估可能并非完全公平。本工作推进了对LLM判断所提供的统计检验结果的评估研究,希望为其他研究者开发可靠的自相关性评估模型奠定基础。