Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever. This setup couples ranking behavior with retrieval quality, so differences in output cannot be attributed to the ranking policy alone. We introduce a controlled diagnostic that isolates reranking by using Multi-News clusters as fixed evidence pools. We limit each pool to exactly eight documents and pass identical inputs to all rankers. Within this setup, BM25 and MMR serve as interpretable reference points for lexical matching and diversity optimization. Across 345 clusters, we find that redundancy patterns vary by model: one LLM implicitly diversifies at larger selection budgets, while another increases redundancy. In contrast, LLMs underperform on lexical coverage at small selection budgets. As a result, LLM rankings diverge substantially from both baselines rather than consistently approximating either strategy. By eliminating retrieval variance, we can attribute these differences directly to the ranking policy. This diagnostic is model-agnostic and applicable to any ranker, including open source systems and proprietary APIs.
翻译:标准的重排序评估研究重排序器如何对上游检索器返回的候选文档进行排序。这种设置将排序行为与检索质量耦合在一起,因此输出差异不能仅归因于排序策略本身。我们引入了一种受控诊断方法,通过使用Multi-News聚类作为固定证据池来隔离重排序过程。我们将每个证据池严格限制为八篇文档,并向所有排序器输入完全相同的文档集合。在此设置下,BM25和MMR作为词汇匹配和多样性优化的可解释参考基准。通过对345个聚类进行分析,我们发现冗余模式因模型而异:一个LLM在较大选择预算下隐式地实现多样化,而另一个LLM反而增加冗余。相比之下,LLM在较小选择预算下的词汇覆盖表现欠佳。因此,LLM的排序结果与两个基线均存在显著差异,而非持续近似于任一策略。通过消除检索方差,我们可以将这些差异直接归因于排序策略。该诊断方法具有模型无关性,适用于包括开源系统和专有API在内的任何排序器。