Very Efficient Listwise Multimodal Reranking for Long Documents

Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at https://github.com/dukesun99/ZipRerank.

翻译：列表式重排序是基于视觉的检索及多模态检索增强生成（M-RAG）处理长文档时的关键组件，但计算开销巨大。尽管近年来基于视觉语言模型（VLM）的重排序方法取得了出色精度，其实际应用常受限于长视觉令牌序列与多步自回归解码。我们提出ZipRerank，一种高效列表式多模态重排序方法，直接应对上述两大瓶颈：通过轻量级查询-图像早期交互机制压缩输入长度，并借助单次前向传播对所有候选结果进行评分，从而消除自回归解码需求。为实现有效学习，ZipRerank采用两阶段训练策略：（i）在渲染为图像的大规模文本数据上进行列表式预训练；（ii）利用VLM教师模型蒸馏的软排序监督信号进行多模态微调。在MMDocIR基准上的大量实验表明，ZipRerank在匹配甚至超越现有最优多模态重排序方法的同时，将大语言模型（LLM）推理延迟降低一个数量级，使其特别适用于对延迟敏感的工业级系统。代码发布于 https://github.com/dukesun99/ZipRerank。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

多模态文档智能：视觉文档检索的现状综述与未来愿景

专知会员服务

14+阅读 · 2月25日

【AAAI2026】URaG：面向高效长文档理解的多模态大语言模型统一检索与生成框架

专知会员服务

15+阅读 · 2025年11月14日

【RecSys22教程】多阶段推荐系统的神经重排序，90页ppt

专知会员服务

27+阅读 · 2022年9月30日