What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

With the growing success of multi-modal learning, research on the robustness of multi-modal models, especially when facing situations with missing modalities, is receiving increased attention. Nevertheless, previous studies in this domain exhibit certain limitations, as they often lack theoretical insights or their methodologies are tied to specific network architectures or modalities. We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective and illustrate that the performance ceiling in such scenarios can be approached by efficiently utilizing the information inherent in non-missing modalities. In practice, there are two key aspects: (1) The encoder should be able to extract sufficiently good features from the non-missing modality; (2) The extracted features should be robust enough not to be influenced by noise during the fusion process across modalities. To this end, we introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA). UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities. Apart from that, UME-MMA, built on a late-fusion learning framework, allows for the plug-and-play use of various encoders, making it suitable for a wide range of modalities and enabling seamless integration of large-scale pre-trained encoders to further enhance performance. And we demonstrate UME-MMA's effectiveness in audio-visual datasets~(e.g., AV-MNIST, Kinetics-Sound, AVE) and vision-language datasets~(e.g., MM-IMDB, UPMC Food101).

翻译：随着多模态学习的日益成功，针对多模态模型鲁棒性的研究，尤其是在面对模态缺失的情况时，正受到越来越多的关注。然而，先前该领域的研究存在某些局限性，因为它们常常缺乏理论洞见，或者其方法论与特定的网络架构或模态绑定。我们从信息论的角度对多模态模型遭遇模态缺失的场景进行建模，并阐明在此类场景下，通过有效利用非缺失模态中固有的信息，可以逼近性能上限。在实践中，有两个关键方面：（1）编码器应能从非缺失模态中提取足够好的特征；（2）提取的特征应足够鲁棒，以免在跨模态融合过程中受到噪声影响。为此，我们引入了带有缺失模态适应的单模态集成方法（UME-MMA）。UME-MMA利用多模态模型的单模态预训练权重以增强特征提取，并采用缺失模态数据增强技术以更好地适应模态缺失的情况。除此之外，基于后期融合学习框架构建的UME-MMA支持各种编码器的即插即用，使其适用于广泛的模态，并能无缝集成大规模预训练编码器以进一步提升性能。我们在音视频数据集（如AV-MNIST、Kinetics-Sound、AVE）以及视觉-语言数据集（如MM-IMDB、UPMC Food101）上展示了UME-MMA的有效性。