Using large language models (LLMs) to predict relevance judgments has shown promising results. Most studies treat this task as a distinct research line, e.g., focusing on prompt design for predicting relevance labels given a query and passage. However, predicting relevance judgments is essentially a form of relevance prediction, a problem extensively studied in tasks such as re-ranking. Despite this potential overlap, little research has explored reusing or adapting established re-ranking methods to predict relevance judgments, leading to potential resource waste and redundant development. To bridge this gap, we reproduce re-rankers in a re-ranker-as-relevance-judge setup. We design two adaptation strategies: (i) using binary tokens (e.g., "true" and "false") generated by a re-ranker as direct judgments, and (ii) converting continuous re-ranking scores into binary labels via thresholding. We perform extensive experiments on TREC-DL 2019 to 2023 with 8 re-rankers from 3 families, ranging from 220M to 32B, and analyse the evaluation bias exhibited by re-ranker-based judges. Results show that re-ranker-based relevance judges, under both strategies, can outperform UMBRELA, a state-of-the-art LLM-based relevance judge, in around 40% to 50% of the cases; they also exhibit strong self-preference towards their own and same-family re-rankers, as well as cross-family bias.
翻译:使用大型语言模型预测相关性判断已展现出有前景的结果。多数研究将此任务视为独立的研究方向,例如专注于针对给定查询和段落预测相关性标签的提示设计。然而,预测相关性判断本质上是相关性预测的一种形式,这是在重排序等任务中被广泛研究的问题。尽管存在这种潜在重叠,但很少有研究探索重用或调整现有的重排序方法来预测相关性判断,这可能导致资源浪费和冗余开发。为弥合这一差距,我们在“重排序器作为相关性评判器”的设置下复现了多种重排序器。我们设计了两种适应策略:(i) 使用重排序器生成的二元标记(如“true”和“false”)作为直接判断;(ii) 通过阈值化将连续的重排序分数转换为二元标签。我们在TREC-DL 2019至2023数据集上进行了大量实验,涉及来自3个系列、参数量从2.2亿到320亿不等的8种重排序器,并分析了基于重排序器的评判器所表现出的评估偏差。结果表明,在两种策略下,基于重排序器的相关性评判器在大约40%至50%的情况下能够超越UMBRELA(一种基于LLM的先进相关性评判器);它们还表现出对自身及同系列重排序器的强烈自偏好,以及跨系列偏差。