Piano audio-to-score transcription (A2S) is an important yet underexplored task with extensive applications for music composition, practice, and analysis. However, existing end-to-end piano A2S systems faced difficulties in retrieving bar-level information such as key and time signatures, and have been trained and evaluated with only synthetic data. To address these limitations, we propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores, enabling the transcription of score information at both the bar and note levels by multi-task learning. To bridge the gap between synthetic data and recordings of human performance, we propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering (EPR) system on synthetic audio, followed by fine-tuning the model using recordings of human performance. To preserve the voicing structure for score reconstruction, we propose a pre-processing method for **Kern scores in scenarios with an unconstrained number of voices. Experimental results support the effectiveness of our proposed approaches, in terms of both transcription performance on synthetic audio data in comparison to the current state-of-the-art, and the first experiment on human recordings.
翻译:钢琴音频到乐谱转录(A2S)是一项重要但尚未得到充分探索的任务,在音乐创作、练习与分析中具有广泛的应用。然而,现有的端到端钢琴A2S系统在检索小节层级信息(如调号与拍号)方面面临困难,且仅使用合成数据进行训练与评估。为应对这些局限性,我们提出一种带有分层解码器的序列到序列(Seq2Seq)模型,该模型与乐谱的层级结构对齐,通过多任务学习实现小节与音符两个层级的乐谱信息转录。为弥合合成数据与人类演奏录音之间的差距,我们提出一种两阶段训练方案:首先使用表现力演奏渲染(EPR)系统在合成音频上对模型进行预训练,随后使用人类演奏录音对模型进行微调。为在声部数量不受限制的场景中保持**Kern乐谱的声部结构以重建乐谱,我们提出一种预处理方法。实验结果支持了我们所提方法的有效性,这既体现在与当前最先进方法相比在合成音频数据上的转录性能,也体现在首次在人类演奏录音上进行的实验。