Despite the strong performance of Transformers, their quadratic computation complexity presents challenges in applying them to vision tasks. Automatic pruning is one of effective methods for reducing computation complexity without heuristic approaches. However, directly applying it to multi-head attention is not straightforward due to channel misalignment. In this paper, we propose an automatic channel pruning method to take into account the multi-head attention mechanism. First, we incorporate channel similarity-based weights into the pruning indicator to preserve more informative channels in each head. Then, we adjust pruning indicator to enforce removal of channels in equal proportions across all heads, preventing the channel misalignment. We also add a reweight module to compensate for information loss resulting from channel removal, and an effective initialization step for pruning indicator based on difference of attention between original structure and each channel. Our proposed method can be used to not only original attention, but also linear attention, which is more efficient as linear complexity with respect to the number of tokens. On ImageNet-1K, applying our pruning method to the FLattenTransformer, which includes both attention mechanisms, shows outperformed accuracy for several MACs compared with previous state-of-the-art efficient models and pruned methods. Code will be available soon.
翻译:尽管Transformer模型表现出卓越的性能,但其二次计算复杂度为视觉任务的应用带来了挑战。自适应剪枝是一种无需启发式方法即可降低计算复杂度的有效技术。然而,由于通道对齐问题,将其直接应用于多头注意力机制并非易事。本文提出了一种考虑多头注意力机制的自适应通道剪枝方法。首先,我们在剪枝指标中引入基于通道相似度的权重,以保留每个注意力头中信息更丰富的通道。其次,我们调整剪枝指标以强制所有注意力头按相同比例移除通道,从而避免通道错位问题。此外,我们引入了重加权模块以补偿通道移除造成的信息损失,并基于原始结构与各通道注意力间的差异设计了有效的剪枝指标初始化策略。所提方法不仅适用于原始注意力机制,也可应用于具有线性令牌复杂度的更高效线性注意力机制。在ImageNet-1K数据集上,将本剪枝方法应用于同时包含两种注意力机制的FLattenTransformer模型,在多项MACs指标下均超越了现有高效模型与剪枝方法的精度表现。代码即将开源。