Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks, including search engines. However, existing work utilizes the generative ability of LLMs for Information Retrieval (IR) rather than direct passage ranking. The discrepancy between the pre-training objectives of LLMs and the ranking objective poses another challenge. In this paper, we first investigate generative LLMs such as ChatGPT and GPT-4 for relevance ranking in IR. Surprisingly, our experiments reveal that properly instructed LLMs can deliver competitive, even superior results to state-of-the-art supervised methods on popular IR benchmarks. Furthermore, to address concerns about data contamination of LLMs, we collect a new test set called NovelEval, based on the latest knowledge and aiming to verify the model's ability to rank unknown knowledge. Finally, to improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models using a permutation distillation scheme. Our evaluation results turn out that a distilled 440M model outperforms a 3B supervised model on the BEIR benchmark. The code to reproduce our results is available at www.github.com/sunnweiwei/RankGPT.
翻译:大型语言模型在包括搜索引擎在内的多种语言相关任务中展现出了显著的零样本泛化能力。然而,现有工作利用大型语言模型的生成能力进行信息检索,而非直接的段落排序。大型语言模型的预训练目标与排序目标之间的差异构成了另一项挑战。本文首次探究了生成式大型语言模型(如ChatGPT和GPT-4)在信息检索中的相关性排序能力。令人惊讶的是,实验表明,经过适当指令调用的LLMs能够在主流IR基准上取得与最先进监督方法相媲美,甚至更优的结果。此外,为解决关于LLMs数据污染的担忧,我们基于最新知识收集了一个名为NovelEval的新测试集,旨在验证模型对未知知识的排序能力。最后,为提高实际应用效率,我们深入探讨了通过排列蒸馏方案将ChatGPT的排序能力蒸馏到小型专用模型中的潜力。评估结果表明,在BEIR基准上,一个440M参数的蒸馏模型性能优于3B参数的监督模型。重现结果的代码已发布于www.github.com/sunnweiwei/RankGPT。