Large vision-language models (LVLMs) excel at visual understanding, but face efficiency challenges due to quadratic complexity in processing long multi-modal contexts. While token compression can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to consider the unique multi-view characteristics of high-resolution LVLMs with dynamic cropping. Existing methods treat all tokens uniformly, but our analysis reveals that global thumbnails can naturally guide the compression of local crops by providing holistic context for informativeness evaluation. In this paper, we first analyze dynamic cropping strategy, revealing both the complementary nature between thumbnails and crops, and the distinctive characteristics across different crops. Based on our observations, we propose ``Global Compression Commander'' (\textit{i.e.}, \textbf{GlobalCom$^2$}), a novel plug-and-play token compression framework for HR-LVLMs. GlobalCom$^2$ leverages thumbnail as the ``commander'' to guide the compression of local crops, adaptively preserving informative details while eliminating redundancy. Extensive experiments show that GlobalCom$^2$ maintains over \textbf{90\%} performance while compressing \textbf{90\%} visual tokens, reducing FLOPs and peak memory to \textbf{9.1\%} and \textbf{60\%}.
翻译:大型视觉语言模型在视觉理解方面表现出色,但由于处理长序列多模态上下文时存在二次复杂度,面临效率挑战。虽然令牌压缩可以降低计算成本,但现有方法专为单视图模型设计,未能考虑采用动态裁剪的高分辨率模型所特有的多视图特性。现有方法对所有令牌进行统一处理,但我们的分析表明,全局缩略图可通过为信息量评估提供整体上下文,自然地指导局部裁剪区域的压缩。本文首先分析了动态裁剪策略,揭示了缩略图与裁剪区域之间的互补性,以及不同裁剪区域的独特特征。基于这些观察,我们提出了“全局压缩指挥官”(即 GlobalCom²),一种面向高分辨率大型视觉语言模型的新型即插即用令牌压缩框架。GlobalCom² 利用缩略图作为“指挥官”来指导局部裁剪区域的压缩,自适应地保留信息细节并消除冗余。大量实验表明,GlobalCom² 在压缩 90% 视觉令牌的同时保持了超过 90% 的性能,将浮点运算量和峰值内存降低至 9.1% 和 60%。