Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.
翻译:我们能否利用视频中已有的视听信息来改进自监督表示学习?为解答这一问题,受自然语言与图像理解领域类似方法的成功启发,我们系统研究了掩码自编码框架下的多种预训练架构与目标函数。实验表明,该方法在视听下游分类任务中取得了显著性能提升,在VGGSound和AudioSet数据集上超越了当前最优水平。此外,通过单一的视听预训练模型,我们可将其应用于多个单模态下游任务。我们还进一步验证了表示的可迁移性——在不针对特定数据集预训练的前提下,在Epic Kitchens数据集上取得了最优的视听任务结果。