Multivector Reranking in the Era of Strong First-Stage Retrievers

Learned multivector representations power modern search systems with strong retrieval effectiveness, but their real-world use is limited by the high cost of exhaustive token-level retrieval. Therefore, most systems adopt a \emph{gather-and-refine} strategy, where a lightweight gather phase selects candidates for full scoring. However, this approach requires expensive searches over large token-level indexes and often misses the documents that would rank highest under full similarity. In this paper, we reproduce several state-of-the-art multivector retrieval methods on two publicly available datasets, providing a clear picture of the current multivector retrieval field and observing the inefficiency of token-level gathering. Building on top of that, we show that replacing the token-level gather phase with a single-vector document retriever -- specifically, a learned sparse retriever (LSR) -- produces a smaller and more semantically coherent candidate set. This recasts the gather-and-refine pipeline into the well-established two-stage retrieval architecture. As retrieval latency decreases, query encoding with two neural encoders becomes the dominant computational bottleneck. To mitigate this, we integrate recent inference-free LSR methods, demonstrating that they preserve the retrieval effectiveness of the dual-encoder pipeline while substantially reducing query encoding time. Finally, we investigate multiple reranking configurations that balance efficiency, memory, and effectiveness, and we introduce two optimization techniques that prune low-quality candidates early. Empirical results show that these techniques improve retrieval efficiency by up to 1.8$\times$ with no loss in quality. Overall, our two-stage approach achieves over $24\times$ speedup over the state-of-the-art multivector retrieval systems, while maintaining comparable or superior retrieval quality.

翻译：学习到的多向量表示凭借其强大的检索效能驱动着现代搜索系统，但其实际应用受限于详尽的词元级检索的高昂成本。因此，大多数系统采用一种 \emph{收集-精炼} 策略，即通过一个轻量级的收集阶段筛选候选文档以进行完整评分。然而，这种方法需要在大型词元级索引上进行昂贵的搜索，并且常常会遗漏在完整相似度计算下本应排名最高的文档。在本文中，我们在两个公开可用的数据集上复现了多种最先进的多向量检索方法，清晰地描绘了当前多向量检索领域的现状，并观察到了词元级收集阶段的低效性。在此基础上，我们证明，使用单向量文档检索器——具体而言，一种学习型稀疏检索器——来替代词元级收集阶段，能够产生一个更小且语义更连贯的候选集。这将收集-精炼流程重塑为成熟的两阶段检索架构。随着检索延迟的降低，使用两个神经编码器进行查询编码成为了主要计算瓶颈。为了缓解这一问题，我们集成了最近的无推理 LSR 方法，证明它们在保持双编码器流程检索效能的同时，显著减少了查询编码时间。最后，我们研究了多种平衡效率、内存和效能的重排序配置，并引入了两种提前剪枝低质量候选的优化技术。实证结果表明，这些技术将检索效率提升了高达 1.8 倍，且没有质量损失。总体而言，我们的两阶段方法相比最先进的多向量检索系统实现了超过 24 倍的加速，同时保持了相当或更优的检索质量。