Ranking documents using Large Language Models (LLMs) by directly feeding the query and candidate documents into the prompt is an interesting and practical problem. However, there has been limited success so far, as researchers have found it difficult to outperform fine-tuned baseline rankers on benchmark datasets. We analyze pointwise and listwise ranking prompts used by existing methods and argue that off-the-shelf LLMs do not fully understand these ranking formulations, possibly due to the nature of how LLMs are trained. In this paper, we propose to significantly reduce the burden on LLMs by using a new technique called Pairwise Ranking Prompting (PRP). Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs. On TREC-DL2020, PRP based on the Flan-UL2 model with 20B parameters outperforms the previous best approach in the literature, which is based on the blackbox commercial GPT-4 that has 50x (estimated) model size, by over 5% at NDCG@1. On TREC-DL2019, PRP is only inferior to the GPT-4 solution on the NDCG@5 and NDCG@10 metrics, while outperforming other existing solutions, such as InstructGPT which has 175B parameters, by over 10% for nearly all ranking metrics. Furthermore, we propose several variants of PRP to improve efficiency and show that it is possible to achieve competitive results even with linear complexity. We also discuss other benefits of PRP, such as supporting both generation and scoring LLM APIs, as well as being insensitive to input ordering.
翻译:使用大语言模型(LLMs)通过直接将查询和候选文档输入提示中进行文档排序是一个有趣且实用的问题。然而,迄今为止取得的成功有限,研究者发现难以在基准数据集上超越经过微调的基线排序器。我们分析了现有方法使用的点式排序和列表排序提示,并认为现成的LLMs未能充分理解这些排序形式,这可能是由于LLMs训练方式的特性所致。本文提出一种名为成对排序提示(PRP)的新技术,显著减轻了LLMs的负担。我们的结果首次在文献中实现使用中等规模开源LLMs在标准基准上达到最先进的排序性能。在TREC-DL2020上,基于20B参数Flan-UL2模型的PRP方法在NDCG@1指标上比文献中此前最佳方法(基于黑盒商业GPT-4,模型规模估计大50倍)提升超过5%。在TREC-DL2019上,PRP仅在NDCG@5和NDCG@10指标上逊于GPT-4方案,而在几乎所有排序指标上均超越其他现有方案(如具有175B参数的InstructGPT)超过10%。此外,我们提出多种PRP变体以提高效率,并证明即使在线性复杂度下也能取得具有竞争力的结果。我们还讨论了PRP的其他优势,例如支持生成式和评分式两种LLM API,以及对输入顺序不敏感的特性。