A range of applications of multi-modal music information retrieval is centred around the problem of connecting large collections of sheet music (images) to corresponding audio recordings, that is, identifying pairs of audio and score excerpts that refer to the same musical content. One of the typical and most recent approaches to this task employs cross-modal deep learning architectures to learn joint embedding spaces that link the two distinct modalities - audio and sheet music images. While there has been steady improvement on this front over the past years, a number of open problems still prevent large-scale employment of this methodology. In this article we attempt to provide an insightful examination of the current developments on audio-sheet music retrieval via deep learning methods. We first identify a set of main challenges on the road towards robust and large-scale cross-modal music retrieval in real scenarios. We then highlight the steps we have taken so far to address some of these challenges, documenting step-by-step improvement along several dimensions. We conclude by analysing the remaining challenges and present ideas for solving these, in order to pave the way to a unified and robust methodology for cross-modal music retrieval.
翻译:多模态音乐信息检索的一系列应用围绕着一个核心问题:将大规模乐谱(图像)集合与对应的音频录音进行关联,即识别指向相同音乐内容的音频与乐谱片段对。针对该任务,典型且最新的方法之一采用跨模态深度学习架构,学习连接两种不同模态(音频与乐谱图像)的联合嵌入空间。尽管过去数年该领域取得了稳步进展,但仍有若干开放性问题阻碍着该方法的大规模应用。本文旨在对当前基于深度学习的音频-乐谱检索进展进行深入剖析。我们首先识别出在真实场景下迈向鲁棒且大规模跨模态音乐检索所面临的一系列主要挑战;继而阐述我们为应对部分挑战已采取的步骤,记录在多个维度上的逐步改进;最后分析剩余挑战并提出解决方案,以期为建立统一且鲁棒的跨模态音乐检索方法论铺平道路。