Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era. Following this trend, the size of multi-modal learning models constantly increases, leading to an urgent need to reduce the massive computational cost of finetuning these models for downstream tasks. In this paper, we propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers. Specifically, we first present a modular multimodal fusion framework that exhibits high flexibility and facilitates mutual interactions among different modalities. In addition, we disentangle vanilla prompts into three types in order to learn different optimizing objectives for multimodal learning. It is also worth noting that we propose to add prompt vectors only on the deep layers of the unimodal transformers, thus significantly reducing the training memory usage. Experiment results show that our proposed method achieves comparable performance to several other multimodal finetuning methods with less than 3% trainable parameters and up to 66% saving of training memory usage.
翻译:大规模预训练已将计算机视觉和自然语言处理等单模态领域带入了一个新时代。随着这一趋势,多模态学习模型的规模不断增大,导致迫切需要减少针对下游任务对这些模型进行微调所带来的巨大计算成本。在本文中,我们提出了一种高效且灵活的多模态融合方法,即PMF,专为融合单模态预训练Transformer而设计。具体而言,我们首先提出了一种模块化多模态融合框架,该框架具有高度灵活性,并促进了不同模态之间的相互交互。此外,我们将原始提示解耦为三种类型,以便为多模态学习学习不同的优化目标。同样值得注意的是,我们提出仅在单模态Transformer的深层添加提示向量,从而显著降低训练内存使用量。实验结果表明,我们提出的方法在不到3%的可训练参数和高达66%的训练内存节省下,实现了与几种其他多模态微调方法相当的性能。