Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the capability of such methods to generalize to real retrieval scenarios. In this work, we investigate whether we can mitigate this limitation with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step, by contrasting randomly augmented views of snippets of both modalities, namely audio and sheet images. Through a number of experiments on synthetic and real piano data, we show that pre-trained models are able to retrieve snippets with better precision in all scenarios and pre-training configurations. Encouraged by these results, we employ the snippet embeddings in the higher-level task of cross-modal piece identification and conduct more experiments on several retrieval configurations. In this task, we observe that the retrieval quality improves from 30% up to 100% when real music data is present. We then conclude by arguing for the potential of self-supervised contrastive learning for alleviating the annotated data scarcity in multi-modal music retrieval models.
翻译:将乐谱图像与音频录音关联仍然是开发高效跨模态音乐检索系统的关键问题。实现该任务的基础方法之一是通过深度神经网络学习跨模态嵌入空间,以连接音频与乐谱的短片段。然而,真实音乐内容中标注数据的稀缺性限制了此类方法在真实检索场景中的泛化能力。本研究探究能否通过自监督对比学习缓解这一局限:作为预训练步骤,通过对比两种模态(即音频和乐谱图像)片段随机增强视图,使网络暴露于大量真实音乐数据。基于合成与真实钢琴数据的多项实验表明,预训练模型在所有场景和预训练配置下均能以更高精度检索片段。受此结果鼓舞,我们将片段嵌入应用于更高层的跨模态乐曲识别任务,并在多种检索配置下开展进一步实验。在该任务中,我们观察到引入真实音乐数据后检索质量提升幅度达30%至100%。最后,我们论证了自监督对比学习在缓解多模态音乐检索模型标注数据稀缺问题上的潜力。