Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

Information retrieval (IR) systems have played a vital role in modern digital life and have cemented their continued usefulness in this new era of generative AI via retrieval-augmented generation. With strong language processing capabilities and remarkable versatility, large language models (LLMs) have become popular choices for zero-shot re-ranking in IR systems. So far, LLM-based re-ranking methods rely on strong generative capabilities, which restricts their use to either specialized or powerful proprietary models. Given these restrictions, we ask: is autoregressive generation necessary and optimal for LLMs to perform re-ranking? We hypothesize that there are abundant signals relevant to re-ranking within LLMs that might not be used to their full potential via generation. To more directly leverage such signals, we propose in-context re-ranking (ICR), a novel method that leverages the change in attention pattern caused by the search query for accurate and efficient re-ranking. To mitigate the intrinsic biases in LLMs, we propose a calibration method using a content-free query. Due to the absence of generation, ICR only requires two ($O(1)$) forward passes to re-rank $N$ documents, making it substantially more efficient than generative re-ranking methods that require at least $O(N)$ forward passes. Our novel design also enables ICR to be applied to any LLM without specialized training while guaranteeing a well-formed ranking. Extensive experiments with two popular open-weight LLMs on standard single-hop and multi-hop information retrieval benchmarks show that ICR outperforms RankGPT while cutting the latency by more than 60% in practice. Through detailed analyses, we show that ICR's performance is specially strong on tasks that require more complex re-ranking signals. Our findings call for further exploration on novel ways of utilizing open-weight LLMs beyond text generation.

翻译：信息检索系统在现代数字生活中发挥着至关重要的作用，并通过检索增强生成技术，在这一生成式人工智能的新时代巩固了其持续的应用价值。凭借强大的语言处理能力和卓越的通用性，大语言模型已成为信息检索系统中零样本重排序的热门选择。迄今为止，基于大语言模型的重排序方法依赖于其强大的生成能力，这限制了它们只能用于专门的或强大的专有模型。鉴于这些限制，我们提出疑问：自回归生成对于大语言模型执行重排序任务是否是必要且最优的？我们假设，大语言模型内部存在大量与重排序相关的信号，这些信号可能未通过生成过程得到充分利用。为了更直接地利用这些信号，我们提出了一种新颖的方法——上下文重排序，该方法利用搜索查询引起的注意力模式变化，以实现准确且高效的重排序。为了缓解大语言模型固有的偏见，我们提出了一种使用无内容查询的校准方法。由于无需生成，上下文重排序仅需两次（$O(1)$）前向传播即可对 $N$ 篇文档进行重排序，这使其效率远高于需要至少 $O(N)$ 次前向传播的生成式重排序方法。我们新颖的设计还使得上下文重排序能够应用于任何大语言模型，而无需专门训练，同时保证形成良好的排序结果。在两个流行的开源权重大语言模型上，对标准的单跳和多跳信息检索基准进行的广泛实验表明，上下文重排序的性能优于RankGPT，同时在实际应用中延迟降低了60%以上。通过详细分析，我们发现上下文重排序在需要更复杂重排序信号的任务上表现尤为突出。我们的研究结果呼吁进一步探索超越文本生成的开源权重大语言模型的新颖应用方式。