Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, \textit{e}.\textit{g}., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (\textbf{\emph{UPop}}) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on various tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed UPop framework. The code is available at https://github.com/sdc17/UPop.
翻译:现实世界数据包含海量多模态信息,其中视觉与语言是最具代表性的两种模态。与此同时,日益庞大的模型(如Transformer)已引起研究者对模型压缩的关注。然而,如何压缩多模态模型特别是视觉-语言Transformer仍待深入探索。本文提出**统一渐进式剪枝方法(UPop)**作为通用视觉-语言Transformer压缩框架,其创新点包括:1)在原始模型的连续优化空间中统一搜索多模态子网络,实现可压缩模态与结构间剪枝比例的自动分配;2)采用渐进式搜索与子网络再训练策略,确保搜索与再训练过程收敛一致性,从而达成更高压缩比。在多种任务、数据集和模型架构上的实验验证了所提出的UPop框架的有效性与通用性。代码已开源至https://github.com/sdc17/UPop。