The tasks of automatic lyrics transcription and lyrics alignment have witnessed significant performance improvements in the past few years. However, most of the previous works only focus on English in which large-scale datasets are available. In this paper, we address lyrics transcription and alignment of polyphonic Mandarin pop music in a low-resource setting. To deal with the data scarcity issue, we adapt pretrained Whisper model and fine-tune it on a monophonic Mandarin singing dataset. With the use of data augmentation and source separation model, results show that the proposed method achieves a character error rate of less than 18% on a Mandarin polyphonic dataset for lyrics transcription, and a mean absolute error of 0.071 seconds for lyrics alignment. Our results demonstrate the potential of adapting a pretrained speech model for lyrics transcription and alignment in low-resource scenarios.
翻译:自动歌词转录与歌词对齐任务在过去几年中取得了显著的性能提升。然而,以往的研究大多聚焦于拥有大规模数据集的英语场景。本文旨在解决多声部中文流行音乐在低资源条件下的歌词转录与对齐问题。为应对数据稀缺性挑战,我们采用预训练的Whisper模型,并在单声部中文歌唱数据集上进行微调。结合数据增强与源分离模型,实验结果表明,所提方法在多声部中文数据集的歌词转录任务中可实现低于18%的字错误率,在歌词对齐任务中平均绝对误差为0.071秒。本研究成果证实了预训练语音模型在低资源场景下用于歌词转录与对齐的潜力。