Where Relevance Emerges: A Layer-Wise Study of Internal Attention for Zero-Shot Re-Ranking

Zero-shot document re-ranking with Large Language Models (LLMs) has evolved from Pointwise methods to Listwise and Setwise approaches that optimize computational efficiency. Despite their success, these methods predominantly rely on generative scoring or output logits, which face bottlenecks in inference latency and result consistency. In-Context Re-ranking (ICR) has recently been proposed as an $O(1)$ alternative method. ICR extracts internal attention signals directly, avoiding the overhead of text generation. However, existing ICR methods simply aggregate signals across all layers; layer-wise contributions and their consistency across architectures have been left unexplored. Furthermore, no unified study has compared internal attention with traditional generative and likelihood-based mechanisms across diverse ranking frameworks under consistent conditions. In this paper, we conduct an orthogonal evaluation of generation, likelihood, and internal attention mechanisms across multiple ranking frameworks. We further identify a universal "bell-curve" distribution of relevance signals across transformer layers, which motivates the proposed Selective-ICR strategy that reduces inference latency by 30%-50% without compromising effectiveness. Finally, evaluation on the reasoning-intensive BRIGHT benchmark shows that precisely capturing high-quality in-context attention signals fundamentally reduces the need for model scaling and reinforcement learning: a zero-shot 8B model matches the performance of 14B reinforcement-learned re-rankers, while even a 0.6B model outperforms state-of-the-art generation-based approaches. These findings redefine the efficiency-effectiveness frontier for LLM-based re-ranking and highlight the latent potential of internal signals for complex reasoning ranking tasks. Our code and results are publicly available at https://github.com/ielab/Selective-ICR.

翻译：基于大语言模型（LLM）的零样本文档重排序已从逐点方法发展为优化计算效率的列表式与集合式方法。尽管这些方法取得了成功，但其主要依赖生成式评分或输出对数概率，在推理延迟和结果一致性方面面临瓶颈。上下文内重排序（ICR）近期被提出作为一种O(1)复杂度的替代方案。ICR直接提取内部注意力信号，避免了文本生成的开销。然而，现有ICR方法仅对所有层的信号进行简单聚合，尚未探究各层的贡献度及其在不同架构间的一致性。此外，在统一条件下尚无研究系统比较内部注意力与传统生成式及基于似然机制在不同排序框架中的表现。本文对多种排序框架下的生成式、似然式及内部注意力机制进行正交评估。我们进一步发现相关性信号在Transformer各层呈现普适的“钟形曲线”分布规律，据此提出选择性ICR策略，在保持效果不变的前提下将推理延迟降低30%-50%。最后，在需要复杂推理的BRIGHT基准测试中，精确捕捉高质量的上下文注意力信号从根本上降低了对模型缩放和强化学习的依赖：零样本8B模型即可达到14B强化学习重排序器的性能，而0.6B模型甚至超越了最先进的基于生成的方法。这些发现重新定义了基于LLM的重排序效率-效果边界，并揭示了内部信号在复杂推理排序任务中的潜在价值。我们的代码与结果已公开于https://github.com/ielab/Selective-ICR。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

Deep Research（深度研究）：系统性综述

专知会员服务

50+阅读 · 2025年12月3日

【NeurIPS2025】VideoLucy：用于长视频理解的深度记忆回溯机制

专知会员服务

9+阅读 · 2025年10月15日

10篇R1相关的研究全面汇总，万字思考！

专知会员服务

30+阅读 · 2025年3月22日

TransMLA：多头潜在注意力（MLA）即为所需

专知会员服务

23+阅读 · 2025年2月13日