Multimodal machine learning, mimicking the human brain's ability to integrate various modalities has seen rapid growth. Most previous multimodal models are trained on perfectly paired multimodal input to reach optimal performance. In real-world deployments, however, the presence of modality is highly variable and unpredictable, causing the pre-trained models in suffering significant performance drops and fail to remain robust with dynamic missing modalities circumstances. In this paper, we present a novel Cyclic INformative Learning framework (CyIN) to bridge the gap between complete and incomplete multimodal learning. Specifically, we firstly build an informative latent space by adopting token- and label-level Information Bottleneck (IB) cyclically among various modalities. Capturing task-related features with variational approximation, the informative bottleneck latents are purified for more efficient cross-modal interaction and multimodal fusion. Moreover, to supplement the missing information caused by incomplete multimodal input, we propose cross-modal cyclic translation by reconstruct the missing modalities with the remained ones through forward and reverse propagation process. With the help of the extracted and reconstructed informative latents, CyIN succeeds in jointly optimizing complete and incomplete multimodal learning in one unified model. Extensive experiments on 4 multimodal datasets demonstrate the superior performance of our method in both complete and diverse incomplete scenarios.
翻译:多模态机器学习通过模拟人脑整合多种模态的能力而迅速发展。以往大多数多模态模型均在完美配对的多模态输入上进行训练以达到最优性能。然而,在实际部署中,模态的存在具有高度可变性和不可预测性,导致预训练模型在动态缺失模态的情况下性能显著下降且难以保持鲁棒性。本文提出一种新颖的循环信息学习框架(CyIN),以弥合完整与不完整多模态学习之间的差距。具体而言,我们首先通过在多种模态间循环采用令牌级与标签级信息瓶颈(IB)来构建信息化的潜在空间。通过变分近似捕获任务相关特征,信息瓶颈潜在表示得以纯化,从而实现更高效的跨模态交互与多模态融合。此外,为补充不完整多模态输入导致的缺失信息,我们提出跨模态循环翻译方法,通过前向与反向传播过程,利用保留的模态重建缺失的模态。借助提取与重建的信息化潜在表示,CyIN成功在单一统一模型中联合优化完整与不完整多模态学习。在4个多模态数据集上的大量实验表明,我们的方法在完整及多种不完整场景下均具有优越性能。