Token compression is essential for reducing the computational and memory requirements of transformer models, enabling their deployment in resource-constrained environments. In this work, we propose an efficient and hardware-compatible token compression method called Prune and Merge. Our approach integrates token pruning and merging operations within transformer models to achieve layer-wise token compression. By introducing trainable merge and reconstruct matrices and utilizing shortcut connections, we efficiently merge tokens while preserving important information and enabling the restoration of pruned tokens. Additionally, we introduce a novel gradient-weighted attention scoring mechanism that computes token importance scores during the training phase, eliminating the need for separate computations during inference and enhancing compression efficiency. We also leverage gradient information to capture the global impact of tokens and automatically identify optimal compression structures. Extensive experiments on the ImageNet-1k and ADE20K datasets validate the effectiveness of our approach, achieving significant speed-ups with minimal accuracy degradation compared to state-of-the-art methods. For instance, on DeiT-Small, we achieve a 1.64$\times$ speed-up with only a 0.2\% drop in accuracy on ImageNet-1k. Moreover, by compressing segmenter models and comparing with existing methods, we demonstrate the superior performance of our approach in terms of efficiency and effectiveness. Code and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/prune_and_merge.
翻译:令牌压缩对于降低Transformer模型的计算和内存需求至关重要,使其能够在资源受限的环境中部署。本文提出一种高效且硬件兼容的令牌压缩方法,称为Prune and Merge。该方法将令牌剪枝与合并操作集成在Transformer模型内部,实现逐层令牌压缩。通过引入可训练的合并与重建矩阵,并利用快捷连接,我们在有效合并令牌的同时保留了重要信息,并实现了对被剪枝令牌的恢复。此外,我们提出一种新颖的梯度加权注意力评分机制,在训练阶段计算令牌重要性分数,从而无需在推理阶段进行单独计算,提升了压缩效率。我们还利用梯度信息捕捉令牌的全局影响,并自动识别最优压缩结构。在ImageNet-1k和ADE20K数据集上的大量实验验证了本方法的有效性,与现有先进方法相比,在精度损失极小的情况下实现了显著的加速。例如,在DeiT-Small模型上,我们在ImageNet-1k数据集上实现了1.64倍的加速,而精度仅下降0.2%。此外,通过对分割模型进行压缩并与现有方法比较,我们证明了本方法在效率和效果方面的优越性能。代码与模型已发布于https://github.com/NUST-Machine-Intelligence-Laboratory/prune_and_merge。