Algorithms for automatic piano transcription have improved dramatically in recent years due to new datasets and modeling techniques. Recent developments have focused primarily on adapting new neural network architectures, such as the Transformer and Perceiver, in order to yield more accurate systems. In this work, we study transcription systems from the perspective of their training data. By measuring their performance on out-of-distribution annotated piano data, we show how these models can severely overfit to acoustic properties of the training data. We create a new set of audio for the MAESTRO dataset, captured automatically in a professional studio recording environment via Yamaha Disklavier playback. Using various data augmentation techniques when training with the original and re-performed versions of the MAESTRO dataset, we achieve state-of-the-art note-onset accuracy of 88.4 F1-score on the MAPS dataset, without seeing any of its training data. We subsequently analyze these data augmentation techniques in a series of ablation studies to better understand their influence on the resulting models.
翻译:近年来,得益于新数据集和建模技术的发展,自动钢琴转录算法取得了显著进步。当前研究主要聚焦于适配Transformer和Perceiver等新型神经网络架构,以提升系统的精确度。本文从训练数据的角度研究转录系统。通过评估模型在分布外标注钢琴数据上的表现,我们发现这些模型可能严重过拟合训练数据的声学特性。我们为MAESTRO数据集创建了一套新的音频,该音频通过Yamaha Disklavier回放系统在专业录音室环境中自动采集。在结合原始版本与再演版本MAESTRO数据集进行训练时,采用多种数据增强技术,我们在未使用MAPS数据集任何训练数据的情况下,实现了88.4%的F1分数(音符起始精度)的当前最优结果。随后,我们通过一系列消融实验分析这些数据增强技术,以深入理解其对模型性能的影响。