Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining and inter-layer dependencies that complicate optimization, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60% of FLOPs), we introduce Token Channel Selection (TCS), a training-free method that filters redundant noise channels at inference time. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52% accuracy (+1.64%p) with 39.4% FLOPs reduction. ToaSt also transfers effectively to diverse downstream tasks (COCO detection, ADE20K segmentation, CIFAR-100 classification), achieving 52.2 versus 51.9 mAP on COCO. Code: github.com/SHANNonLab-HUFS/ToaSt
翻译:视觉Transformer(ViT)已在各类视觉任务中取得显著成功,但其部署常因高昂的计算成本而受阻。尽管结构化权重剪枝与令牌压缩已成为有前景的解决方案,但它们分别面临训练耗时长、层间依赖复杂导致优化困难等问题。我们提出ToaSt,一种解耦框架,将专门策略应用于ViT的不同组件。对多头自注意力模块实施耦合的头级结构化剪枝,利用注意力运算特性提升鲁棒性;而针对前馈网络(占FLOPs逾60%),引入无需训练的令牌通道选择(TCS)方法,在推理阶段滤除冗余噪声通道。在涵盖DeiT、ViT-MAE、Swin Transformer等九种不同模型的广泛评估中,ToaSt在精度与效率间实现优越的权衡,持续超越现有基线。在ViT-MAE-Huge上,ToaSt以39.4%的FLOPs缩减实现88.52%的精度(提升1.64个百分点)。ToaSt还能有效迁移至COCO检测、ADE20K分割、CIFAR-100分类等多种下游任务,在COCO上达到52.2 mAP(对比基线51.9)。代码:github.com/SHANNonLab-HUFS/ToaSt