Contrastive learning constitutes an emerging branch of self-supervised learning that leverages large amounts of unlabeled data, by learning a latent space, where pairs of different views of the same sample are associated. In this paper, we propose musical source association as a pair generation strategy in the context of contrastive music representation learning. To this end, we modify COLA, a widely used contrastive learning audio framework, to learn to associate a song excerpt with a stochastically selected and automatically extracted vocal or instrumental source. We further introduce a novel modification to the contrastive loss to incorporate information about the existence or absence of specific sources. Our experimental evaluation in three different downstream tasks (music auto-tagging, instrument classification and music genre classification) using the publicly available Magna-Tag-A-Tune (MTAT) as a source dataset yields competitive results to existing literature methods, as well as faster network convergence. The results also show that this pre-training method can be steered towards specific features, according to the selected musical source, while also being dependent on the quality of the separated sources.
翻译:对比学习构成了自监督学习的一个新兴分支,它通过利用大量无标注数据来学习一个潜在空间,在该空间中同一样本的不同视角表示被关联起来。在本文中,我们提出将音乐源关联作为对比音乐表示学习中的一种配对生成策略。为此,我们修改了广泛使用的对比学习音频框架COLA,使其能够学习将歌曲片段与随机选择并自动提取的人声或乐器源进行关联。我们进一步引入了一种对比损失的创新修改,以整合特定源存在与否的信息。我们使用公开可用的Magna-Tag-A-Tune(MTAT)作为源数据集,在三个不同下游任务(音乐自动标注、乐器分类和音乐风格分类)上进行的实验评估获得了与现有文献方法相当的结果,同时实现了更快的网络收敛速度。结果表明,这种预训练方法可以根据所选音乐源引导向特定特征,同时也依赖于分离源的质量。