We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with three complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra- and inter-modal contrastive learning with masking, and (3) self-training by reconstructing joint audio-video contextualized features learned from the first two objectives. Pre-training with MAViL not only enables the model to perform well in audio-visual classification and retrieval tasks but also improves representations of each modality in isolation, without using information from the other modality for fine-tuning or inference. Empirically, MAViL sets a new state-of-the-art on AudioSet (53.1 mAP) and VGGSound (67.1% accuracy). For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on these benchmarks.
翻译:我们提出掩码音频-视频学习器(Masked Audio-Video Learners,简称MAViL)以训练音视频表示。该方法通过三种互补的自监督形式进行学习:(1)重构被掩码的音频和视频输入数据;(2)结合掩码的模态内部与模态间对比学习;(3)通过重构前两个目标学习到的联合音视频上下文特征进行自训练。采用MAViL进行预训练不仅使模型在音视频分类与检索任务中表现优异,还能在无需借助另一模态信息进行微调或推理的情况下,独立提升各单一模态的表征质量。实验表明,MAViL在AudioSet(53.1 mAP)和VGGSound(67.1%准确率)上取得了新的最优性能。这是自监督音视频模型首次在这些基准测试中超越使用外部监督的模型。