Automatic Music Transcription (AMT) is a crucial technology in music information processing. Despite recent improvements in performance through machine learning approaches, existing methods often achieve high accuracy in domains with abundant annotation data, primarily due to the difficulty of creating annotation data. A practical transcription model requires an architecture that does not require an annotation data. In this paper, we propose an annotation-free transcription model achieved through the utilization of scalable synthetic audio for pre-training and adversarial domain confusion using unannotated real audio. Through evaluation experiments, we confirm that our proposed method can achieve higher accuracy under annotation-free conditions compared to when learning with mixture of annotated real audio data. Additionally, through ablation studies, we gain insights into the scalability of this approach and the challenges that lie ahead in the field of AMT research.
翻译:自动音乐转录(Automatic Music Transcription, AMT)是音乐信息处理中的关键技术。尽管近年来通过机器学习方法在性能上取得了显著提升,但由于标注数据生成的困难,现有方法往往在拥有丰富标注数据的领域中达到较高准确率。一个实用的转录模型需要一种无需标注数据的架构。本文提出一种无标注转录模型,该模型通过利用可扩展的合成音频进行预训练,并结合未标注真实音频的对抗域混淆技术实现。通过评估实验,我们证实所提方法在无标注条件下能够达到比使用混合标注真实音频学习时更高的准确率。此外,通过消融研究,我们获得了关于该方法可扩展性的见解,以及AMT研究领域未来面临的挑战。