Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at https://github.com/dukesun99/ZipRerank.
翻译:列表式重排序是基于视觉的检索及多模态检索增强生成(M-RAG)处理长文档时的关键组件,但计算开销巨大。尽管近年来基于视觉语言模型(VLM)的重排序方法取得了出色精度,其实际应用常受限于长视觉令牌序列与多步自回归解码。我们提出ZipRerank,一种高效列表式多模态重排序方法,直接应对上述两大瓶颈:通过轻量级查询-图像早期交互机制压缩输入长度,并借助单次前向传播对所有候选结果进行评分,从而消除自回归解码需求。为实现有效学习,ZipRerank采用两阶段训练策略:(i)在渲染为图像的大规模文本数据上进行列表式预训练;(ii)利用VLM教师模型蒸馏的软排序监督信号进行多模态微调。在MMDocIR基准上的大量实验表明,ZipRerank在匹配甚至超越现有最优多模态重排序方法的同时,将大语言模型(LLM)推理延迟降低一个数量级,使其特别适用于对延迟敏感的工业级系统。代码发布于 https://github.com/dukesun99/ZipRerank。