Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose simple and parameter-efficient adaptation procedures for pretrained multimodal networks. In particular, we exploit low-rank adaptation and modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 0.7% of the total parameters in most experiments). We conduct a series of experiments to highlight the robustness of our proposed method using diverse datasets for RGB-thermal and RGB-Depth semantic segmentation, multimodal material segmentation, and multimodal sentiment analysis tasks. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.
翻译:多模态学习旨在利用来自多个来源的数据,以提升下游任务的整体性能。理想情况下,数据中的冗余性应使多模态系统能够对相关模态中部分观测缺失或损坏的情况具有鲁棒性。然而,我们观察到,若在测试时缺失一个或多个模态,现有多种多模态网络的性能会显著下降。为赋予模型对缺失模态的鲁棒性,我们提出针对预训练多模态网络的简单且参数高效的适配流程。具体而言,我们利用低秩适配与中间特征调制来补偿缺失模态。实验表明,此类适配可部分弥补因模态缺失导致的性能下降,并在某些情况下优于针对可用模态组合独立训练的专用网络。所提出的适配方法所需参数量极小(在多数实验中少于总参数的0.7%)。我们针对RGB-热成像与RGB-深度语义分割、多模态材料分割及多模态情感分析任务,使用多样化数据集开展系列实验,以突出所提方法的鲁棒性。该方法在各类任务与数据集上展现出通用性,并在缺失模态的鲁棒多模态学习任务中优于现有方法。