We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model's generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble(UME) and the proposed Uni-Modal Teacher(UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.
翻译:我们将多模态数据的特征(即学习到的表示)抽象为:1) 单模态特征,可通过单模态训练获得;2) 配对特征,只能通过跨模态交互学习。多模态模型应在确保单模态特征学习的基础上,受益于跨模态交互。然而,近期有监督多模态后期融合训练方法仍存在各模态单模态特征学习不充分的问题。我们证明该现象确实会损害模型的泛化能力。为此,针对给定的有监督多模态任务,我们提出根据单模态特征与配对特征的分布,从单模态集成(Uni-Modal Ensemble,UME)与所提出的单模态教师(Uni-Modal Teacher,UMT)中选择目标后期融合学习方法。我们证明,在简单引导策略下,该方法能在VGG-Sound、Kinetics-400、UCF101及ModelNet40等多个多模态数据集上取得与其他复杂后期融合或中期融合方法相当的结果。