Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive multi-vector index storage overhead. Existing training-free pruning methods either rely on heuristic layer choices or degrade sharply under aggressive compression, leading prior work to argue that effective high-compression pruning requires query-dependent training. We challenge this view with Structural Anchor Pruning (SAP), a self-calibrating, training-free, and query-agnostic index-time pruning framework with three components: (i) Score Retention (SR), a white-box per-layer compression diagnostic; (ii) SR-guided window selection, a procedure that automatically locates the structural pruning region for any backbone with no per-model hyperparameters; and (iii) a visual in-degree centrality scorer that identifies anchor patches within the selected window. On the ViDoRe v1/v2 benchmarks across three architectures spanning 18, 28, and 36 backbone layers, SAP retains over 90\% of NDCG@5 while pruning more than 90\% of visual tokens, without any per-model parameter tuning. Our layer-resolved SR analysis reveals an Alignment-Aggregation Divergence: the document's visual structure is preserved as a stable ``Structural Plateau'' within the backbone, but the final layers reshape this representation into a sparse, query-aligned form that is no longer suitable for pruning. This is the mechanistic reason SAP succeeds where final-layer methods fail.
翻译:近期视觉-语言模型(如ColPali)实现了细粒度的视觉文档检索,但带来了高昂的多向量索引存储开销。现有无训练剪枝方法或依赖启发式层选择,或在激进压缩下性能急剧下降,导致先前工作认为有效的高压缩剪枝需要查询相关的训练。我们通过结构锚点剪枝(SAP)挑战此观点,这是一种自校准、无训练且查询无关的索引时剪枝框架,包含三个组件:(i)得分保留(SR),一种白盒逐层压缩诊断方法;(ii)SR引导的窗口选择,该过程自动定位任意骨干网络的结构剪枝区域,无需逐一调整模型超参数;(iii)视觉入度中心性评分器,用于识别选定窗口内的锚点块。在涵盖18层、28层和36层骨干网络的三种架构的ViDoRe v1/v2基准测试中,SAP在剪枝超过90%视觉令牌的同时保持超过90%的NDCG@5,且无需任何逐模型参数调整。我们的层解析SR分析揭示了对齐-聚合分歧:文档视觉结构在骨干网络中表现为稳定的"结构高原",但最终层将此表征重塑为稀疏的、查询对齐的形式,不再适合剪枝。这即是SAP在最终层方法失效时取得成功的机制原因。