Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical. Code is available at https://github.com/MaQianKun0/ApET.
翻译:近期,视觉语言模型(VLMs)展现出卓越的多模态理解能力,然而冗余的视觉令牌带来了极高的计算开销并降低了推理效率。先前的研究通常依赖[CLS]注意力或文本-视觉交叉注意力来识别并丢弃冗余视觉令牌。尽管取得了有希望的结果,此类方案容易引入位置偏差,且更关键的是无法与FlashAttention等高效注意力内核兼容,限制了其在VLM加速中的实际部署。本文摒弃注意力依赖,从信息论角度重新审视视觉令牌压缩,旨在不涉及任何注意力机制的情况下最大限度地保留视觉信息。我们提出了ApET,一种近似误差引导的令牌压缩框架。ApET首先通过线性近似,用少量基令牌重构原始视觉令牌,随后利用近似误差来识别并丢弃信息量最低的令牌。在多个VLM和基准测试上的广泛实验表明,ApET在图像理解任务中保留了原始性能的95.2%,在视频理解任务中甚至达到了100.4%,同时分别将令牌预算压缩了88.9%和87.5%。得益于其无注意力设计,ApET能够无缝集成FlashAttention,实现进一步的推理加速,使VLM部署更具实用性。代码发布于https://github.com/MaQianKun0/ApET。