Automatic Music Transcription (AMT) is a vital technology in the field of music information processing. Despite recent enhancements in performance due to machine learning techniques, current methods typically attain high accuracy in domains where abundant annotated data is available. Addressing domains with low or no resources continues to be an unresolved challenge. To tackle this issue, we propose a transcription model that does not require any MIDI-audio paired data through the utilization of scalable synthetic audio for pre-training and adversarial domain confusion using unannotated real audio. In experiments, we evaluate methods under the real-world application scenario where training datasets do not include the MIDI annotation of audio in the target data domain. Our proposed method achieved competitive performance relative to established baseline methods, despite not utilizing any real datasets of paired MIDI-audio. Additionally, ablation studies have provided insights into the scalability of this approach and the forthcoming challenges in the field of AMT research.
翻译:自动音乐转录(AMT)是音乐信息处理领域的关键技术。尽管机器学习技术近期提升了性能表现,现有方法通常仅在标注数据充足的领域才能达到高精度。针对低资源或无资源领域的转录问题仍是未解决的挑战。为解决此问题,我们提出一种无需任何MIDI-音频配对数据的转录模型,该模型通过可扩展合成音频进行预训练,并利用未标注真实音频进行对抗性域混淆。实验中,我们在训练数据集不包含目标数据域音频MIDI标注的实际应用场景下评估方法性能。尽管未使用任何真实的MIDI-音频配对数据集,我们提出的方法相较于既有基线模型仍展现出具有竞争力的性能。此外,消融研究揭示了该方法的可扩展性,并指出了AMT研究领域未来面临的挑战。