Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era. Following this trend, the size of multi-modal learning models constantly increases, leading to an urgent need to reduce the massive computational cost of finetuning these models for downstream tasks. In this paper, we propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers. Specifically, we first present a modular multimodal fusion framework that exhibits high flexibility and facilitates mutual interactions among different modalities. In addition, we disentangle vanilla prompts into three types in order to learn different optimizing objectives for multimodal learning. It is also worth noting that we propose to add prompt vectors only on the deep layers of the unimodal transformers, thus significantly reducing the training memory usage. Experiment results show that our proposed method achieves comparable performance to several other multimodal finetuning methods with less than 3% trainable parameters and up to 66% saving of training memory usage.
翻译:大规模预训练已将计算机视觉和自然语言处理等单模态领域推向新时代。受此趋势影响,多模态学习模型的规模持续增大,导致亟需降低这些模型在下游任务微调过程中的巨大计算成本。本文提出一种高效灵活的多模态融合方法——PMF,专为融合单模态预训练Transformer而设计。具体而言,我们首先构建一个模块化的多模态融合框架,该框架具有高度灵活性,并能促进不同模态间的相互交互。此外,我们将原始提示解耦为三种类型,以学习多模态学习中不同的优化目标。值得特别关注的是,我们提出仅在单模态Transformer的深层添加提示向量,从而显著降低训练内存占用。实验结果表明,在可训练参数少于3%且训练内存节省高达66%的条件下,我们的方法取得了与多种多模态微调方法相当的性能。