Automatic music transcription (AMT), aiming to convert musical signals into musical notation, is one of the important tasks in music information retrieval. Recently, previous works have applied high-resolution labels, i.e., the continuous onset and offset times of piano notes, as training targets, achieving substantial improvements in transcription performance. However, there still remain some issues to be addressed, e.g., the harmonics of notes are sometimes recognized as false positive notes, and the size of AMT model tends to be larger to improve the transcription performance. To address these issues, we propose an improved high-resolution piano transcription model to well capture specific acoustic characteristics of music signals. First, we employ the Constant-Q Transform as the input representation to better adapt to musical signals. Moreover, we have designed two architectures: the first is based on a convolutional recurrent neural network (CRNN) with dilated convolution, and the second is an encoder-decoder architecture that combines CRNN with a non-autoregressive Transformer decoder. We conduct systematic experiments for our models. Compared to the high-resolution AMT system used as a baseline, our models effectively achieve 1) consistent improvement in note-level metrics, and 2) the significant smaller model size, which shed lights on future work.
翻译:自动音乐转录(AMT)旨在将音乐信号转换为乐谱符号,是音乐信息检索领域的重要任务之一。近年来,已有研究采用高分辨率标签(即钢琴音符的连续起始与终止时间)作为训练目标,显著提升了转录性能。然而,该领域仍存在若干待解决问题:例如音符的谐波成分有时会被误识别为伪阳性音符,且为提升性能往往导致AMT模型体量过大。针对这些问题,我们提出一种改进的高分辨率钢琴转录模型,以更精准地捕捉音乐信号的特有声学特征。首先,我们采用常量Q变换作为输入表示,以更好地适配音乐信号特性。此外,我们设计了两种架构:第一种基于采用扩张卷积的卷积循环神经网络(CRNN),第二种则是将CRNN与非自回归Transformer解码器结合的编码器-解码器架构。我们对所提模型进行了系统性实验验证。与作为基线的高分辨率AMT系统相比,我们的模型实现了:1)音符级评估指标的持续提升;2)模型体量的显著缩减,这为未来研究提供了重要启示。