Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, \textit{e}.\textit{g}., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (\textbf{\emph{UPop}}) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on various tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed UPop framework. The code is available at https://github.com/sdc17/UPop.
翻译:现实世界数据包含大量多模态信息,其中视觉与语言是最具代表性的两种模态。此外,日益复杂的模型(如Transformer)引发了研究者对模型压缩的关注。然而,如何压缩多模态模型(尤其是视觉-语言Transformer)仍待深入探索。本文提出**统一渐进式剪枝(UPop)**框架,这是一种通用的视觉-语言Transformer压缩框架,其创新包括:1)在原始模型的连续优化空间中统一搜索多模态子网,实现可压缩模态与结构间剪枝比例的自动分配;2)通过渐进式搜索与重训练子网,保持搜索与重训练过程的收敛一致性,从而获得更高压缩比。针对多种任务、数据集与模型架构的实验验证了所提UPop框架的有效性与普适性。代码开源地址:https://github.com/sdc17/UPop。