Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive index vector size overheads. Training-free pruning solutions (e.g., EOS-attention based methods) can reduce index vector size by approximately 60% without model adaptation, but often underperform random selection in high-compression scenarios (> 80%). Prior research (e.g., Light-ColPali) attributes this to the conclusion that visual token importance is inherently query-dependent, thereby questioning the feasibility of training-free pruning. In this work, we propose Structural Anchor Pruning (SAP), a training-free pruning method that identifies key visual patches from middle layers to achieve high performance compression. We also introduce Oracle Score Retention (OSR) protocol to evaluate how layer-wise information affects compression efficiency. Evaluations on the ViDoRe benchmark demonstrate that SAP reduces index vectors by over 90% while maintaining robust retrieval fidelity, providing a highly scalable solution for Visual RAG. Furthermore, our OSR-based analysis reveals that semantic structural anchor patches persist in the middle layers, unlike traditional pruning solutions that focus on the final layer where structural signals dissipate.
翻译:近期视觉语言模型(如ColPali)虽能实现细粒度视觉文档检索,却带来了难以承受的索引向量规模开销。免训练的剪枝方案(例如基于EOS注意力的方法)可在无需模型适配的情况下将索引向量规模缩减约60%,但在高压缩场景(>80%)下其表现常逊于随机选择。先前研究(如Light-ColPali)将此归因于视觉令牌重要性本质上具有查询依赖性,从而质疑了免训练剪枝的可行性。本文提出结构锚点剪枝,这是一种免训练的剪枝方法,通过从中间层识别关键视觉块来实现高性能压缩。我们还引入了Oracle分数保留协议,用以评估分层信息如何影响压缩效率。在ViDoRe基准上的评估表明,SAP能将索引向量缩减超过90%,同时保持稳健的检索保真度,为视觉RAG提供了高度可扩展的解决方案。此外,我们基于OSR的分析揭示,语义结构锚点块持续存在于中间层,这与传统剪枝方案聚焦于结构信号已消散的最终层形成鲜明对比。