Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a feature projection module to project the multidimensional features of different modalities into a common space with rich information reserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to unreliable modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval.
翻译:多模态学习假设在训练过程中可获取所有感兴趣的模态组合,以学习跨模态对应关系。本文挑战了这一模态完备性假设,转而致力于在推理阶段泛化至未见模态组合。我们提出未见模态交互问题,并引入首个解决方案。该方法利用特征投影模块,将不同模态的多维特征投影至一个信息丰富的公共空间,使得信息可通过简单求和操作跨可用模态进行累积。为减少训练过程中对不可靠模态组合的过拟合,我们进一步通过伪监督指标(指示模态预测的可靠性)改进模型学习。通过对多模态视频分类、机器人状态回归及多媒体检索等任务的评估,我们证明了该方法在多样化任务与模态中的有效性。