This paper explores the tasks of leveraging auxiliary modalities which are only available at training to enhance multimodal representation learning through cross-modal Knowledge Distillation (KD). The widely adopted mutual information maximization-based objective leads to a short-cut solution of the weak teacher, i.e., achieving the maximum mutual information by simply making the teacher model as weak as the student model. To prevent such a weak solution, we introduce an additional objective term, i.e., the mutual information between the teacher and the auxiliary modality model. Besides, to narrow down the information gap between the student and teacher, we further propose to minimize the conditional entropy of the teacher given the student. Novel training schemes based on contrastive learning and adversarial learning are designed to optimize the mutual information and the conditional entropy, respectively. Experimental results on three popular multimodal benchmark datasets have shown that the proposed method outperforms a range of state-of-the-art approaches for video recognition, video retrieval and emotion classification.
翻译:本文探索利用仅在训练阶段可用的辅助模态,通过跨模态知识蒸馏(KD)增强多模态表示学习的任务。广泛采用的基于互信息最大化的目标函数会导致弱教师的捷径解,即通过简单地将教师模型设计得与学生模型一样弱来实现最大互信息。为防止此类弱解,我们引入一个额外的目标项——教师与辅助模态模型之间的互信息。此外,为了缩小学生与教师之间的信息差距,我们进一步提出最小化给定学生条件下教师的条件熵。基于对比学习和对抗学习设计的新型训练方案分别用于优化互信息和条件熵。在三个主流多模态基准数据集上的实验结果表明,所提出方法在视频识别、视频检索和情感分类任务上优于一系列现有最优方法。