Contrastive learning constitutes an emerging branch of self-supervised learning that leverages large amounts of unlabeled data, by learning a latent space, where pairs of different views of the same sample are associated. In this paper, we propose musical source association as a pair generation strategy in the context of contrastive music representation learning. To this end, we modify COLA, a widely used contrastive learning audio framework, to learn to associate a song excerpt with a stochastically selected and automatically extracted vocal or instrumental source. We further introduce a novel modification to the contrastive loss to incorporate information about the existence or absence of specific sources. Our experimental evaluation in three different downstream tasks (music auto-tagging, instrument classification and music genre classification) using the publicly available Magna-Tag-A-Tune (MTAT) as a source dataset yields competitive results to existing literature methods, as well as faster network convergence. The results also show that this pre-training method can be steered towards specific features, according to the selected musical source, while also being dependent on the quality of the separated sources.
翻译:对比学习是自监督学习的一个新兴分支,通过利用大量无标签数据学习潜在空间,使同一样本的不同视图对相互关联。本文提出在对比音乐表示学习背景下,将音乐源关联作为一种配对生成策略。为此,我们修改了广泛使用的对比学习音频框架COLA,使其学习将歌曲片段与随机选取并自动提取的人声或乐器源进行关联。我们进一步对对比损失函数提出新颖的改进,引入特定音乐源存在与否的信息。我们在三个不同的下游任务(音乐自动标注、乐器分类和音乐流派分类)中进行的实验评估,使用公开的Magna-Tag-A-Tune(MTAT)作为源数据集,取得了与现有文献方法相竞争的结果,并实现了更快的网络收敛速度。结果表明,这种预训练方法可根据所选音乐源导向特定特征,同时也依赖于分离源的质量。