Cross-modal distillation has been widely used to transfer knowledge across different modalities, enriching the representation of the target unimodal one. Recent studies highly relate the temporal synchronization between vision and sound to the semantic consistency for cross-modal distillation. However, such semantic consistency from the synchronization is hard to guarantee in unconstrained videos, due to the irrelevant modality noise and differentiated semantic correlation. To this end, we first propose a \textit{Modality Noise Filter} (MNF) module to erase the irrelevant noise in teacher modality with cross-modal context. After this purification, we then design a \textit{Contrastive Semantic Calibration} (CSC) module to adaptively distill useful knowledge for target modality, by referring to the differentiated sample-wise semantic correlation in a contrastive fashion. Extensive experiments show that our method could bring a performance boost compared with other distillation methods in both visual action recognition and video retrieval task. We also extend to the audio tagging task to prove the generalization of our method. The source code is available at \href{https://github.com/GeWu-Lab/cross-modal-distillation}{https://github.com/GeWu-Lab/cross-modal-distillation}.
翻译:跨模态蒸馏已被广泛用于在不同模态间转移知识,以丰富目标单模态的表征能力。近期研究表明,视觉与声音之间的时间同步性与跨模态蒸馏的语义一致性高度相关。然而,在无约束视频中,由于无关模态噪声和差异化语义关联的存在,这种源于同步性的语义一致性难以得到保证。为此,我们首先提出一种\textit{模态噪声滤波器}(MNF)模块,用于通过跨模态上下文消除教师模态中的无关噪声。在完成这一净化步骤后,我们进一步设计了一种\textit{对比式语义校准}(CSC)模块,通过以对比方式参考差异化样本级语义关联,自适应地为目标模态蒸馏有用知识。大量实验表明,在视觉动作识别和视频检索任务中,我们的方法相比其他蒸馏方法能够带来性能提升。我们还将其扩展到音频标注任务,以证明方法的泛化能力。源代码已发布于\href{https://github.com/GeWu-Lab/cross-modal-distillation}{https://github.com/GeWu-Lab/cross-modal-distillation}。