The unjudged document problem, where systems that did not contribute to the original judgement pool may retrieve documents without a relevance judgement, is a key obstacle to the reuseability of test collections in information retrieval. While the de facto standard to deal with the problem is to treat unjudged documents as non-relevant, many alternatives have been proposed, such as the use of large language models (LLMs) as a relevance judge (LLM-as-a-judge). However, this has been criticized, among other things, as circular, since the same LLM can be used as the ranker and the judge. We propose to train topic-specific relevance classifiers instead: By finetuning monoT5 with independent LoRA weight adaptation on the judgments of a single assessor for a single topic's pool, we align it to that assessor's notion of relevance for the topic. The system rankings obtained through our classifier's relevance judgments achieve a Spearmans' $ρ$ correlation of $>0.94$ with ground truth system rankings. As little as 128 initial human judgments per topic suffice to improve the comparability of models, compared to treating unjudged documents as non-relevant, while achieving more reliability than existing LLM-as-a-judge approaches. Topic-specific relevance classifiers are thus a lightweight and straightforward way to tackle the unjudged document problem, while maintaining human judgments as the gold standard for retrieval evaluation. Code, models, and data are made openly available.
翻译:未判定文档问题——即未参与原始判定池的系统可能检索到未经相关性判定的文档——是信息检索领域测试集可复用性的主要障碍。虽然处理该问题的实际标准是将未判定文档视为不相关,但已有多种替代方案被提出,例如使用大语言模型作为相关性判定器。然而这种做法被批评存在循环论证等问题,因为同一大语言模型既可用作排序器又可用作判定器。我们提出训练主题特定相关性分类器的替代方案:通过对单主题判定池中单评估者的判定结果,使用独立LoRA权重适配对monoT5进行微调,使其与该评估者对该主题的相关性判定标准对齐。通过我们分类器的相关性判定获得的系统排序结果,与真实系统排序的斯皮尔曼$ρ$相关系数达到$>0.94$。相比将未判定文档视为不相关的处理方法,每个主题仅需128个初始人工判定即可提升模型可比性,同时比现有的大语言模型判定方案更具可靠性。因此,主题特定相关性分类器是解决未判定文档问题的轻量级直接方案,同时保持人工判定作为检索评估的黄金标准。相关代码、模型与数据均已开源。