Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval. Project website: https://xiaobai1217.github.io/Unseen-Modality-Interaction/.
翻译:多模态学习假设在训练过程中,所有感兴趣模态组合均可用于学习跨模态对应关系。在本文中,我们挑战了这一模态完备假设,转而致力于在推理阶段实现对未见模态组合的泛化。我们提出了未见模态交互问题,并首次引入一种解决方案。该方法利用一个模块,将不同模态的多维特征投影到信息保留丰富的公共空间中,从而允许通过跨可用模态的简单求和操作累积信息。为减少训练过程中对区分性较弱模态组合的过拟合,我们进一步通过伪监督(指示模态预测的可靠性)改进模型学习。通过在多模态视频分类、机器人状态回归及多媒体检索等不同任务与模态上的评估,我们证明了该方法的有效性。项目网站:https://xiaobai1217.github.io/Unseen-Modality-Interaction/。