We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled video clips. XKD is trained with two pseudo tasks. First, masked data reconstruction is performed to learn individual representations from audio and visual streams. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through teacher-student setups to learn complementary information. To identify the most effective information to transfer and also to tackle the domain gap between audio and visual modalities which could hinder knowledge transfer, we introduce a domain alignment and feature refinement strategy for effective cross-modal knowledge distillation. Lastly, to develop a general-purpose network capable of handling both audio and visual streams, modality-agnostic variants of our proposed framework are introduced, which use the same backbone for both audio and visual modalities. Our proposed cross-modal knowledge distillation improves linear evaluation top-1 accuracy of video action classification by 8.6% on UCF101, 8.2% on HMDB51, 13.9% on Kinetics-Sound, and 15.7% on Kinetics400. Additionally, our modality-agnostic variant shows promising results in developing a general-purpose network capable of learning both data streams for solving different downstream tasks.
翻译:我们提出 XKD —— 一种新颖的自监督框架,旨在从无标签视频片段中学习有意义的表示。XKD 通过两种伪任务进行训练。首先,通过掩码数据重建从音频和视觉流中分别学习个体表示。其次,通过师生机制在两种模态间执行自监督跨模态知识蒸馏,以学习互补信息。为识别最有效的迁移信息并解决可能阻碍知识迁移的音频-视觉模态领域差异,我们引入了一种领域对齐与特征精炼策略,以实现高效的跨模态知识蒸馏。最后,为开发能同时处理音频与视觉流的通用网络,我们提出了框架的模态无关变体,该变体对音频和视觉模态使用相同的骨干网络。我们提出的跨模态知识蒸馏在视频动作分类的线性评估中,将 UCF101、HMDB51、Kinetics-Sound 和 Kinetics400 的 top-1 准确率分别提升了 8.6%、8.2%、13.9% 和 15.7%。此外,我们的模态无关变体在开发能够学习两种数据流以解决不同下游任务的通用网络方面展现了良好前景。