Competitive music transcription models require large amounts of paired audio-score data, which is scarce due to collection costs, alignment difficulty, and copyright restrictions. Meanwhile, vast quantities of unpaired audio recordings and symbolic scores are freely available but have gone unused. We adopt a cycle-consistent translation framework in which a small amount of paired data acts as a minimal anchor, unlocking the full potential of the unpaired pool. We find that: unpaired data yields surprisingly large gains, especially under limited supervision; unpaired audio contributes more than unpaired scores; incorporating unlabeled audio from a new instrument during training improves transcription for that instrument without any paired supervision. Together, these results suggest that scaling unpaired data offers a practical path toward high-quality transcription for instruments where labeled data remains scarce.
翻译:竞争性的音乐转录模型需要大量配对的音频-乐谱数据,但由于采集成本高、对齐困难及版权限制,这类数据十分稀缺。与此同时,海量非配对的音频录音和符号化乐谱虽可免费获取,却未被充分利用。我们采用了一种循环一致性翻译框架,其中少量配对数据充当最小锚点,从而充分释放非配对数据池的潜力。我们发现:非配对数据能带来显著增益,尤其在监督有限的情况下;非配对音频的贡献大于非配对乐谱;在训练过程中引入新乐器的无标签音频,无需任何配对监督即可提升该乐器的转录效果。综合这些结果表明,扩展非配对数据为标记数据仍稀少的乐器实现高质量转录提供了一条可行路径。