重排序器作为相关性评判器 (Re-Rankers as Relevance Judges)

Using large language models (LLMs) to predict relevance judgments has shown promising results. Most studies treat this task as a distinct research line, e.g., focusing on prompt design for predicting relevance labels given a query and passage. However, predicting relevance judgments is essentially a form of relevance prediction, a problem extensively studied in tasks such as re-ranking. Despite this potential overlap, little research has explored reusing or adapting established re-ranking methods to predict relevance judgments, leading to potential resource waste and redundant development. To bridge this gap, we reproduce re-rankers in a re-ranker-as-relevance-judge setup. We design two adaptation strategies: (i) using binary tokens (e.g., "true" and "false") generated by a re-ranker as direct judgments, and (ii) converting continuous re-ranking scores into binary labels via thresholding. We perform extensive experiments on TREC-DL 2019 to 2023 with 8 re-rankers from 3 families, ranging from 220M to 32B, and analyse the evaluation bias exhibited by re-ranker-based judges. Results show that re-ranker-based relevance judges, under both strategies, can outperform UMBRELA, a state-of-the-art LLM-based relevance judge, in around 40% to 50% of the cases; they also exhibit strong self-preference towards their own and same-family re-rankers, as well as cross-family bias.

翻译：使用大型语言模型预测相关性判断已展现出有前景的结果。多数研究将此任务视为独立的研究方向，例如专注于针对给定查询和段落预测相关性标签的提示设计。然而，预测相关性判断本质上是相关性预测的一种形式，这是在重排序等任务中被广泛研究的问题。尽管存在这种潜在重叠，但很少有研究探索重用或调整现有的重排序方法来预测相关性判断，这可能导致资源浪费和冗余开发。为弥合这一差距，我们在“重排序器作为相关性评判器”的设置下复现了多种重排序器。我们设计了两种适应策略：(i) 使用重排序器生成的二元标记（如“true”和“false”）作为直接判断；(ii) 通过阈值化将连续的重排序分数转换为二元标签。我们在TREC-DL 2019至2023数据集上进行了大量实验，涉及来自3个系列、参数量从2.2亿到320亿不等的8种重排序器，并分析了基于重排序器的评判器所表现出的评估偏差。结果表明，在两种策略下，基于重排序器的相关性评判器在大约40%至50%的情况下能够超越UMBRELA（一种基于LLM的先进相关性评判器）；它们还表现出对自身及同系列重排序器的强烈自偏好，以及跨系列偏差。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

大语言模型在序列推荐中的应用

专知会员服务

19+阅读 · 2024年11月12日

RecInterpreter：架起大语言模型与传统推荐模型的桥梁

专知会员服务

54+阅读 · 2023年11月9日

ChatGP能生成，但搜索行么? 山大百度最新《将大型语言模型作为重排序代理进行研究》

专知会员服务

35+阅读 · 2023年4月20日

【RecSys22教程】多阶段推荐系统的神经重排序，90页ppt

专知会员服务

27+阅读 · 2022年9月30日