Automatic Music Transcription (AMT) is the task of recognizing notes in audio recordings of music. The State-of-the-Art (SotA) benchmarks have been dominated by deep learning systems. Due to the scarcity of high quality data, they are usually trained and evaluated exclusively or predominantly on classical piano music. Unfortunately, that hinders our ability to understand how they generalize to other music. Previous works have revealed several aspects of memorization and overfitting in these systems. We identify two primary sources of distribution shift: the music, and the sound. Complementing recent results on the sound axis (i.e. acoustics, timbre), we investigate the musical one (i.e. note combinations, dynamics, genre). We evaluate the performance of several SotA AMT systems on two new experimental test sets which we carefully construct to emulate different levels of musical distribution shift. Our results reveal a stark performance gap, shedding further light on the Corpus Bias problem, and the extent to which it continues to trouble these systems.
翻译:自动音乐转录(AMT)是从音乐录音中识别音符的任务。当前最先进的基准测试主要由深度学习系统主导。由于高质量数据的稀缺性,这些系统通常完全或主要基于古典钢琴音乐进行训练和评估。遗憾的是,这限制了我们理解它们对其他音乐的泛化能力。先前的研究已揭示这些系统中存在记忆与过拟合的多个方面。我们识别出分布偏移的两个主要来源:音乐本身与声音特性。在补充近期关于声音轴(即声学、音色)的研究结果基础上,我们重点考察音乐轴(即音符组合、力度、流派)。我们在两个精心构建的新实验测试集上评估了多种最先进AMT系统的性能,这些测试集被设计为模拟不同层级的音乐分布偏移。研究结果揭示了显著的性能差距,进一步阐明了语料库偏差问题的存在及其对这些系统造成的持续困扰。