Automatic drum transcription is a critical tool in Music Information Retrieval for extracting and analyzing the rhythm of a music track, but it is limited by the size of the datasets available for training. A popular method used to increase the amount of data is by generating them synthetically from music scores rendered with virtual instruments. This method can produce a virtually infinite quantity of tracks, but empirical evidence shows that models trained on previously created synthetic datasets do not transfer well to real tracks. In this work, besides increasing the amount of data, we identify and evaluate three more strategies that practitioners can use to improve the realism of the generated data and, thus, narrow the synthetic-to-real transfer gap. To explore their efficacy, we used them to build a new synthetic dataset and then we measured how the performance of a model scales and, specifically, at what value it will stagnate when increasing the number of training tracks for different datasets. By doing this, we were able to prove that the aforementioned strategies contribute to make our dataset the one with the most realistic data distribution and the lowest synthetic-to-real transfer gap among the synthetic datasets we evaluated. We conclude by highlighting the limits of training with infinite data in drum transcription and we show how they can be overcome.
翻译:自动鼓谱转录是音乐信息检索中用于提取与分析音乐节奏的关键工具,但其发展受限于可用于训练的标注数据集规模。为扩充数据量,一种常用方法是通过虚拟乐器渲染乐谱来合成生成数据。该方法理论上可产生无限量的音轨,但实证研究表明,基于现有合成数据集训练的模型在真实音轨上的迁移效果不佳。本研究除增加数据规模外,进一步识别并评估了三种可提升合成数据真实性的实用策略,从而缩小合成-真实迁移差距。为探究其有效性,我们运用这些策略构建了新的合成数据集,并系统测量了模型性能随训练音轨数量增加的扩展规律,特别关注了不同数据集上性能停滞的临界点。实验证明,上述策略使我们构建的数据集在评估的所有合成数据集中具有最接近真实的数据分布和最小的合成-真实迁移差距。最后,我们指出了鼓谱转录任务中无限数据训练的局限性,并展示了克服这些局限的可能途径。