Accelerating Transformers with Spectrum-Preserving Token Merging

Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior works have proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top k similar tokens. However, these methods have significant drawbacks, such as sensitivity to token-splitting strategies and damage to informative tokens in later layers. This paper presents a novel paradigm called PiToMe, which prioritizes the preservation of informative tokens using an additional metric termed the energy score. This score identifies large clusters of similar tokens as high-energy, indicating potential candidates for merging, while smaller (unique and isolated) clusters are considered as low-energy and preserved. Experimental findings demonstrate that PiToMe saved from 40-60\% FLOPs of the base models while exhibiting superior off-the-shelf performance on image classification (0.5\% average performance drop of ViT-MAE-H compared to 2.6\% as baselines), image-text retrieval (0.3\% average performance drop of CLIP on Flickr30k compared to 4.5\% as others), and analogously in visual questions answering with LLaVa-7B. Furthermore, PiToMe is theoretically shown to preserve intrinsic spectral properties of the original token space under mild conditions

翻译：提升Transformer架构的吞吐量是机器学习领域的一个重要问题，该架构是众多视觉与语言任务（如GPT、LLaVa）中前沿模型的基础组件。近期一种有效的策略是在Transformer模型内部合并令牌表示，旨在降低计算与内存需求的同时保持精度。先前的研究提出了基于二分图软匹配（BSM）的算法，该算法将令牌划分为不同集合并合并相似度最高的k个令牌。然而，这些方法存在显著缺陷，例如对令牌划分策略敏感，并可能损害后续层中的信息性令牌。本文提出了一种名为PiToMe的新范式，其通过引入一项称为能量分数的额外度量，优先保留信息性令牌。该分数将大型相似令牌簇识别为高能量，指示其可作为合并的候选对象；而较小（独特且孤立）的簇则被视为低能量并予以保留。实验结果表明，PiToMe在基础模型上节省了40-60%的浮点运算量，同时在图像分类（ViT-MAE-H平均性能仅下降0.5%，而基线方法下降2.6%）、图文检索（CLIP在Flickr30k上平均性能仅下降0.3%，其他方法下降4.5%）以及LLaVa-7B的视觉问答任务中均表现出优异的即用性能。此外，理论分析表明，在温和条件下，PiToMe能够保持原始令牌空间的内在频谱特性。