We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional features extracted at different resolutions from the audio and visual streams and uses the attention features to encode the audio and visual input based on their correspondence. We evaluated the representations learned by the model to classify audio-visual correlation as well as to recommend sound effects for visual scenes. Our results show that the representations generated by the attention model improves the correlation accuracy compared to the baseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is a public video dataset. Additionally, audio-visual representations learned by training the attention model with cross-modal contrastive learning further improves the recommendation performance, based on our evaluation using VGG-Sound and a more challenging dataset consisting of gameplay video recordings.
翻译:我们提出了一种新颖的自监督方法,用于从未标记视频中基于音频与视觉的对应性学习音频和视觉表征。该方法采用注意力机制,学习从音频流和视觉流中提取的不同分辨率卷积特征的相对重要性,并利用注意力特征根据音频与视觉的对应性对输入进行编码。我们评估了模型学习到的表征在音频-视觉相关性分类以及为视觉场景推荐音效两方面的性能。实验结果表明,在公开视频数据集VGG-Sound上,注意力模型生成的表征相比基线方法将相关性分类准确率提升了18%,推荐准确率提升了10%。此外,通过跨模态对比学习训练注意力模型所获得的音频-视觉表征,基于我们在VGG-Sound和更具挑战性的游戏录屏数据集上的评估,进一步提升了推荐性能。