We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model's generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble(UME) and the proposed Uni-Modal Teacher(UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.
翻译:我们将多模态数据中的特征(即学习到的表示)抽象为:1)单模态特征,可经由单模态训练习得;2)配对特征,仅能从跨模态交互中学习。多模态模型应在确保单模态特征学习的基础上,通过跨模态交互获得性能提升。然而,当前有监督的多模态后期融合训练方法仍存在各模态单模态特征学习不充分的问题。我们证明,这一现象确实会损害模型的泛化能力。为此,我们提出根据单模态特征与配对特征的分布,为给定监督多模态任务从单模态集成(UME)和所提出的单模态教师(UMT)中选择针对性后期融合学习方法。研究表明,在简单引导策略下,我们在多种多模态数据集(包括VGG-Sound、Kinetics-400、UCF101和ModelNet40)上可获得与复杂的后期融合或中间融合方法相当的结果。