Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, e.g., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (UPop) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on multiple generative and discriminative vision-language tasks, including Visual Reasoning, Image Caption, Visual Question Answer, Image-Text Retrieval, Text-Image Retrieval, and Image Classification, demonstrate the effectiveness and versatility of the proposed UPop framework.
翻译:摘要:现实世界数据包含海量多模态信息,其中视觉和语言是两种最具代表性的模态。此外,日益庞大的模型(如Transformer)吸引了研究者对模型压缩的关注。然而,如何压缩多模态模型,特别是视觉-语言Transformer,仍是一个尚未充分探索的课题。本文提出**统一渐进式剪枝**(UPop)作为通用的视觉-语言Transformer压缩框架,该框架包含:1)在原始模型的连续优化空间中统一搜索多模态子网络,从而在可压缩模态和结构间自动分配剪枝比率;2)逐步搜索并重新训练子网络,保持搜索与重训练之间的收敛性,以实现更高的压缩比率。在多项生成型和判别型视觉-语言任务(包括视觉推理、图像描述、视觉问答、图像-文本检索、文本-图像检索和图像分类)上的实验,证明了所提出的UPop框架的有效性和通用性。