Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools

Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever. This setup couples ranking behavior with retrieval quality, so differences in output cannot be attributed to the ranking policy alone. We introduce a controlled diagnostic that isolates reranking by using Multi-News clusters as fixed evidence pools. We limit each pool to exactly eight documents and pass identical inputs to all rankers. Within this setup, BM25 and MMR serve as interpretable reference points for lexical matching and diversity optimization. Across 345 clusters, we find that redundancy patterns vary by model: one LLM implicitly diversifies at larger selection budgets, while another increases redundancy. In contrast, LLMs underperform on lexical coverage at small selection budgets. As a result, LLM rankings diverge substantially from both baselines rather than consistently approximating either strategy. By eliminating retrieval variance, we can attribute these differences directly to the ranking policy. This diagnostic is model-agnostic and applicable to any ranker, including open source systems and proprietary APIs.

翻译：标准的重排序评估研究重排序器如何对上游检索器返回的候选文档进行排序。这种设置将排序行为与检索质量耦合在一起，因此输出差异不能仅归因于排序策略本身。我们引入了一种受控诊断方法，通过使用Multi-News聚类作为固定证据池来隔离重排序过程。我们将每个证据池严格限制为八篇文档，并向所有排序器输入完全相同的文档集合。在此设置下，BM25和MMR作为词汇匹配和多样性优化的可解释参考基准。通过对345个聚类进行分析，我们发现冗余模式因模型而异：一个LLM在较大选择预算下隐式地实现多样化，而另一个LLM反而增加冗余。相比之下，LLM在较小选择预算下的词汇覆盖表现欠佳。因此，LLM的排序结果与两个基线均存在显著差异，而非持续近似于任一策略。通过消除检索方差，我们可以将这些差异直接归因于排序策略。该诊断方法具有模型无关性，适用于包括开源系统和专有API在内的任何排序器。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

《序列推荐》最新综述

专知会员服务

22+阅读 · 2024年12月27日

【AAAI2023】统一序列更好:时间间隔感知数据增强的序列推荐

专知会员服务

16+阅读 · 2022年12月31日

【RecSys22教程】多阶段推荐系统的神经重排序，90页ppt

专知会员服务

27+阅读 · 2022年9月30日

索邦大学121页博士论文《时间序列中的无监督异常检测》

专知会员服务

104+阅读 · 2022年7月25日