Contrastive learning is a powerful way of learning multimodal representations across various domains such as image-caption retrieval and audio-visual representation learning. In this work, we investigate if these findings generalize to the domain of music videos. Specifically, we create a dual en-coder for the audio and video modalities and train it using a bidirectional contrastive loss. For the experiments, we use an industry dataset containing 550 000 music videos as well as the public Million Song Dataset, and evaluate the quality of learned representations on the downstream tasks of music tagging and genre classification. Our results indicate that pre-trained networks without contrastive fine-tuning outperform our contrastive learning approach when evaluated on both tasks. To gain a better understanding of the reasons contrastive learning was not successful for music videos, we perform a qualitative analysis of the learned representations, revealing why contrastive learning might have difficulties uniting embeddings from two modalities. Based on these findings, we outline possible directions for future work. To facilitate the reproducibility of our results, we share our code and the pre-trained model.
翻译:对比学习是一种强大的多模态表示学习方法,广泛应用于图像-文本检索和视听表示学习等领域。本研究探讨了这些发现是否适用于音乐视频领域。具体而言,我们构建了一个针对音频和视频模态的双编码器模型,并使用双向对比损失进行训练。实验中,我们采用了包含55万条音乐视频的行业数据集以及公开的百万歌曲数据集,并在音乐标签和流派分类这两个下游任务上评估了学习表示的质量。结果表明,在两项任务评估中,未经对比微调的预训练网络均优于我们提出的对比学习方法。为深入理解对比学习在音乐视频领域效果不佳的原因,我们对学习到的表示进行了定性分析,揭示了对比学习可能难以融合来自两种模态的嵌入向量的内在机制。基于这些发现,我们提出了未来工作的可能方向。为促进结果的可复现性,我们公开了相关代码和预训练模型。