Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.
翻译:视觉Transformer需要大量的计算资源和内存带宽,这严重限制了其在边缘设备上的部署。尽管近期的结构化剪枝方法成功降低了理论FLOPs,但它们通常仅在单一结构粒度上操作,并依赖复杂的多阶段流程及事后阈值处理来满足稀疏性预算。本文提出分层自动剪枝(HiAP),这是一种连续松弛框架,可在单次端到端训练阶段中发现最优子网络,无需人工设计重要性启发式规则或预定义逐层稀疏目标。HiAP在多个粒度上引入了随机Gumbel-Sigmoid门控:宏观门控用于剪除整个注意力头与前馈网络模块,微观门控则用于选择性剪除注意力头内部维度及前馈网络神经元。通过同时优化这两个层级,HiAP同时解决了加载大型矩阵的内存瓶颈开销与计算密集型数学运算问题。通过结合结构可行性惩罚项与解析FLOPs的损失函数,HiAP能够自然地收敛至稳定的子网络。在ImageNet上进行的大量实验表明,HiAP能够有机地发现高效架构,并为DeiT-Small等模型实现了具有竞争力的精度-效率帕累托前沿,其性能与复杂的多阶段方法相当,同时显著简化了部署流程。