The computational demands of Vision Transformers (ViTs) and Vision-Language Models (VLMs) remain a significant challenge due to the quadratic complexity of self-attention. While token pruning offers a promising solution, existing methods often introduce training overhead or fail to adapt dynamically across layers. We present SAINT, a training-free token pruning framework that leverages token similarity and a graph-based formulation to dynamically optimize pruning rates and redundancy thresholds. Through systematic analysis, we identify a universal three-stage token evolution process (aligner-explorer-aggregator) in transformers, enabling aggressive pruning in early stages without sacrificing critical information. For ViTs, SAINT doubles the throughput of ViT-H/14 at 224px with only 0.6% accuracy loss on ImageNet-1K, surpassing the closest competitor by 0.8%. For VLMs, we apply SAINT in three modes: ViT-only, LLM-only, and hybrid. SAINT reduces LLaVA-13B's tokens by 75%, achieving latency comparable to LLaVA-7B with less than 1% performance loss across benchmarks. Our work establishes a unified, practical framework for efficient inference in ViTs and VLMs.
翻译:视觉变换器(ViTs)与视觉语言模型(VLMs)因其自注意力机制的二次复杂度,计算需求依然构成显著挑战。尽管令牌剪枝提供了一种有前景的解决方案,但现有方法常引入训练开销或无法跨层动态适应。本文提出SAINT,一种免训练的令牌剪枝框架,它利用令牌相似性与基于图的建模,动态优化剪枝率与冗余阈值。通过系统性分析,我们识别出变换器中普遍存在的三阶段令牌演化过程(对齐器-探索器-聚合器),从而能够在早期阶段进行激进剪枝而不损失关键信息。对于ViTs,SAINT在ImageNet-1K上仅损失0.6%精度的情况下,使ViT-H/14在224px下的吞吐量翻倍,超越最接近的竞争对手0.8%。对于VLMs,我们以三种模式应用SAINT:仅ViT、仅LLM及混合模式。SAINT将LLaVA-13B的令牌数量减少75%,在各项基准测试中性能损失小于1%的情况下,实现了与LLaVA-7B相当的推理延迟。本研究为ViTs与VLMs的高效推理建立了一个统一且实用的框架。