Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining times and global propagation that creates optimization challenges, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60\% of FLOPs), we introduce Token Channel Selection (TCS) that enhances compression ratios while avoiding global propagation issues. Our analysis reveals TCS effectively filters redundant noise during selection. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52\% accuracy (+1.64 \%) with 39.4\% FLOPs reduction. ToaSt transfers effectively to downstream tasks, achieving 52.2 versus 51.9 mAP on COCO object detection. Code and models will be released upon acceptance.
翻译:视觉Transformer(ViTs)在各种视觉任务中取得了显著成功,但其部署常受制于高昂的计算成本。尽管结构化权重剪枝与令牌压缩已成为有前景的解决方案,但它们分别面临重训练时间过长和全局传播导致的优化挑战。我们提出ToaSt——一个对ViT不同组件采用专项策略的解耦框架。我们对多头自注意力模块应用耦合的头向结构化剪枝,利用注意力操作特性增强鲁棒性。针对前馈网络(占FLOPs超60%),我们提出令牌通道选择(TCS)方法,在提升压缩率的同时避免全局传播问题。我们的分析表明TCS能在选择过程中有效过滤冗余噪声。通过对DeiT、ViT-MAE和Swin Transformer等九种不同模型的广泛评估,证明ToaSt在精度与效率间实现了更优权衡,持续超越现有基线方法。在ViT-MAE-Huge上,ToaSt以39.4%的FLOPs削减实现了88.52%的准确率(+1.64%)。ToaSt能有效迁移至下游任务,在COCO目标检测任务中达到52.2 mAP(基线为51.9 mAP)。代码与模型将在论文录用后开源。