We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the "ear" by transcribing the audio, while GPT-4 serves as the "brain," acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task.
翻译:我们提出LyricWhiz,一种鲁棒、多语言、零样本的自动歌词转录方法,在多种歌词转录数据集上取得领先性能,甚至在摇滚和金属等挑战性音乐流派中表现优异。我们新颖的无训练方法利用了Whisper(一种弱监督鲁棒语音识别模型)和GPT-4(当前性能最强的聊天式大型语言模型)。在该方法中,Whisper作为“耳朵”转录音频,而GPT-4则作为“大脑”,通过强大的上下文输出选择与校正能力执行标注任务。实验表明,LyricWhiz在英语中显著降低了词错误率,并能有效转录多种语言的歌词。此外,我们基于MTG-Jamendo数据集,使用LyricWhiz构建了首个公开可用、大规模、多语言的歌词转录数据集(采用CC-BY-NC-SA版权许可),并提供人工标注子集用于噪声水平估计与评估。我们预期提出的方法和数据集将推动多语言歌词转录这一新兴且富有挑战性的任务的发展。