Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose simple and parameter-efficient adaptation procedures for pretrained multimodal networks. In particular, we exploit low-rank adaptation and modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 0.7% of the total parameters in most experiments). We conduct a series of experiments to highlight the robustness of our proposed method using diverse datasets for RGB-thermal and RGB-Depth semantic segmentation, multimodal material segmentation, and multimodal sentiment analysis tasks. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.
翻译:多模态学习旨在利用来自多个源的数据提升下游任务的整体性能。数据中的冗余性应当使多模态系统能够在某些相关模态缺失或损坏时保持鲁棒性。然而,我们观察到若在测试阶段缺失一个或多个模态,现有多种多模态网络的性能会显著下降。为赋予模型对缺失模态的鲁棒性,我们针对预训练多模态网络提出了简单且参数高效的适配流程。具体而言,我们利用低秩适配与中间特征调制来补偿缺失模态。实验表明,此类适配可部分弥补因模态缺失导致的性能下降,在某些情况下甚至优于针对可用模态组合独立训练的专用网络。所提适配方法所需参数量极低(例如,在大多数实验中少于总参数的0.7%)。我们通过一系列实验验证了该方法在RGB-热成像语义分割、RGB-深度语义分割、多模态材料分割及多模态情感分析任务中的鲁棒性。所提方法在各类任务与数据集上展现出通用性,并在缺失模态鲁棒多模态学习任务中优于现有方法。