Note-level automatic music transcription is one of the most representative music information retrieval (MIR) tasks and has been studied for various instruments to understand music. However, due to the lack of high-quality labeled data, transcription of many instruments is still a challenging task. In particular, in the case of singing, it is difficult to find accurate notes due to its expressiveness in pitch, timbre, and dynamics. In this paper, we propose a method of finding note onsets of singing voice more accurately by leveraging the linguistic characteristics of singing, which are not seen in other instruments. The proposed model uses mel-scaled spectrogram and phonetic posteriorgram (PPG), a frame-wise likelihood of phoneme, as an input of the onset detection network while PPG is generated by the pre-trained network with singing and speech data. To verify how linguistic features affect onset detection, we compare the evaluation results through the dataset with different languages and divide onset types for detailed analysis. Our approach substantially improves the performance of singing transcription and therefore emphasizes the importance of linguistic features in singing analysis.
翻译:音符级自动音乐转录是音乐信息检索(MIR)中最具代表性的任务之一,研究者们针对多种乐器进行了研究以理解音乐。然而,由于缺乏高质量标注数据,许多乐器的转录仍是一项具有挑战性的任务。特别是对于歌声而言,其音高、音色和动态表现力丰富,使得准确识别音符变得困难。本文提出了一种方法,通过利用歌声中独有的语言特征(其他乐器不具备)来更精确地定位歌声的音符起始点。所提模型使用梅尔尺度频谱图和音素后验图(PPG)——即逐帧音素似然值——作为起始点检测网络的输入,其中PPG由使用歌声和语音数据预训练的网络生成。为验证语言特征对起始点检测的影响,我们通过不同语言的数据集对比评估结果,并划分起始点类型进行详细分析。本方法显著提升了歌声转录的性能,从而强调了语言特征在歌声分析中的重要性。