Multimodal large language models (MLLMs) have recently shown strong potential as point-wise rerankers by directly modeling query--document relevance through next-token prediction. However, point-wise reranking suffers from substantial repeated computation across query--document pairs, while the causal structure of transformers allows only prefix segments to be reused via pre-caching. To address the misalignment of existing query-first and document-first formats with both VQA-style prompting and computation-aware reuse, we propose a \textit{vision-first} formulation that improves both cache reuse efficiency and reranking performance. However, the remaining cost is still considerable and stems from three main sources: (1) \textit{model depth}, for which we reduce active parameters via early exit; (2) \textit{cross-segment attention}, which we restrict to a narrow interaction band across a few layers; and (3) \textit{visual tokens}, where we reduce the number of tokens via embedder-guided pruning. Together, these designs form miniReranker, which reduces reranking runtime to <1% of the dense implementation under high-reuse settings for a single query, while preserving >96% of the dense model performance.
翻译:摘要:多模态大语言模型通过直接建模查询-文档相关性以进行下一个词元预测,近期在作为点式重排序器方面展现出强大潜力。然而,点式重排序面临跨查询-文档对的重复计算问题,而Transformer的因果结构仅允许通过预缓存复用前缀片段。为解决现有查询优先和文档优先格式与视觉问答提示风格及计算感知复用之间的错配问题,我们提出一种**视觉优先**公式,可同时提升缓存复用效率与重排序性能。但剩余的计算成本仍相当可观,主要来源于三个方面:(1) **模型深度**——我们通过提前退出机制减少活跃参数;(2) **跨段注意力**——我们将其限制在少数层内的窄交互带中;(3) **视觉词元**——我们通过嵌入器引导剪枝减少词元数量。这些设计共同构成了miniReranker,在高复用场景下处理单个查询时,其重排序运行时间可降至密集实现的<1%,同时保持>96%的密集模型性能。