We introduce the $\textbf{O}$ne-shot $\textbf{P}$runing $\textbf{T}$echnique for $\textbf{I}$nterchangeable $\textbf{N}$etworks ($\textbf{OPTIN}$) framework as a tool to increase the efficiency of pre-trained transformer architectures $\textit{without requiring re-training}$. Recent works have explored improving transformer efficiency, however often incur computationally expensive re-training procedures or depend on architecture-specific characteristics, thus impeding practical wide-scale adoption. To address these shortcomings, the OPTIN framework leverages intermediate feature distillation, capturing the long-range dependencies of model parameters (coined $\textit{trajectory}$), to produce state-of-the-art results on natural language, image classification, transfer learning, and semantic segmentation tasks $\textit{without re-training}$. Given a FLOP constraint, the OPTIN framework will compress the network while maintaining competitive accuracy performance and improved throughput. Particularly, we show a $\leq 2$% accuracy degradation from NLP baselines and a $0.5$% improvement from state-of-the-art methods on image classification at competitive FLOPs reductions. We further demonstrate the generalization of tasks and architecture with comparative performance using Mask2Former for semantic segmentation and cnn-style networks. OPTIN presents one of the first one-shot efficient frameworks for compressing transformer architectures that generalizes well across different class domains, in particular: natural language and image-related tasks, without $\textit{re-training}$.
翻译:我们提出了$\textbf{O}$ne-shot $\textbf{P}$runing $\textbf{T}$echnique for $\textbf{I}$nterchangeable $\textbf{N}$etworks($\textbf{OPTIN}$)框架,作为一种无需重新训练即可提升预训练Transformer架构效率的工具。近期的研究致力于提升Transformer效率,但往往需要高昂计算成本的重新训练过程,或依赖于特定架构的特性,从而阻碍了其在大规模实际场景中的广泛应用。为克服这些不足,OPTIN框架利用中间特征蒸馏,捕获模型参数的长程依赖关系(称为$\textit{轨迹}$),在自然语言处理、图像分类、迁移学习和语义分割任务中实现了无需重新训练的先进性能。在给定FLOP约束下,OPTIN框架能在保持竞争性精度性能的同时压缩网络并提升吞吐量。具体而言,我们展示了在NLP基线基础上精度下降不超过2%,在图像分类任务中以具有竞争力的FLOP削减量较现有先进方法提升0.5%精度。我们还通过Mask2Former用于语义分割以及CNN风格网络,进一步证明了任务与架构的泛化能力。OPTIN是首个无需重新训练即可在跨领域(特别是自然语言和图像相关任务)中良好泛化的一刀切式高效Transformer压缩框架。