Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose PIO-FVLM from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specially, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle. The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance, with a 2.67$\times$ prefill speedup, 2.11$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead. Our code is available at https://github.com/ocy1/PIO-FVLM.
翻译:近期,通过减少视觉-语言模型中的冗余视觉令牌以加速其推理已成为一个热点话题。然而,现有方法大多依赖于基于视觉令牌间相似性或跨模态视觉-文本相似性构建的启发式规则,这导致其在压缩性能和实际部署方面存在一定局限。相比之下,我们从推理目标的角度提出了PIO-FVLM,该方法将视觉令牌压缩问题转化为保持输出结果不变性,并主要依据令牌对此目标的重要性进行选择。具体而言,我们通过设计的层局部代理损失(一种从当前层到最终结果的粗略约束)生成令牌级梯度显著性,并以此为指导对视觉令牌进行重排序。随后,依据非极大值抑制原则筛选出最具价值的视觉令牌。所提出的PIO-FVLM无需训练,且与FlashAttention兼容,对实际应用和部署友好。它可以作为无编码器方法独立部署,也可以与VisionZip等编码器压缩方法结合,作为编码器参与的方法使用。在LLaVA-Next-7B上,PIO-FVLM仅保留11.1%的视觉令牌,却能维持97.2%的原始性能,同时实现了2.67倍的预填充加速、2.11倍的推理加速、6.22倍更低的FLOPs以及6.05倍减少的KV Cache开销。我们的代码可在 https://github.com/ocy1/PIO-FVLM 获取。