Multimodal learning typically relies on the assumption that all modalities are fully available during both the training and inference phases. However, in real-world scenarios, consistently acquiring complete multimodal data presents significant challenges due to various factors. This often leads to the issue of missing modalities, where data for certain modalities are absent, posing considerable obstacles not only for the availability of multimodal pretrained models but also for their fine-tuning and the preservation of robustness in downstream tasks. To address these challenges, we propose a novel framework integrating parameter-efficient fine-tuning of unimodal pretrained models with a self-supervised joint-embedding learning method. This framework enables the model to predict the embedding of a missing modality in the representation space during inference. Our method effectively predicts the missing embedding through prompt tuning, leveraging information from available modalities. We evaluate our approach on several multimodal benchmark datasets and demonstrate its effectiveness and robustness across various scenarios of missing modalities.
翻译:多模态学习通常依赖于训练和推理阶段所有模态完全可用的假设。然而,在实际场景中,由于各种因素,持续获取完整的多模态数据存在显著挑战。这常常导致模态缺失问题,即某些模态的数据缺失,这不仅对多模态预训练模型的可用性构成重大障碍,也对其微调及下游任务中鲁棒性的保持带来困难。为应对这些挑战,我们提出了一种新颖框架,将单模态预训练模型的参数高效微调与自监督联合嵌入学习方法相结合。该框架使模型能够在推理过程中预测表示空间中缺失模态的嵌入。我们的方法通过提示调优,利用可用模态的信息,有效预测缺失嵌入。我们在多个多模态基准数据集上评估了所提方法,并证明了其在各种模态缺失场景下的有效性和鲁棒性。