We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the "ear" by transcribing the audio, while GPT-4 serves as the "brain," acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task.
翻译:本文介绍LyricWhiz——一种鲁棒、多语言、零样本的自动歌词转录方法,该方法在多种歌词转录数据集上实现了最先进的性能,甚至在摇滚和金属等具有挑战性的音乐流派中亦表现优异。我们提出的新型免训练方法利用了弱监督鲁棒语音识别模型Whisper与当前性能最优的基于聊天的大型语言模型GPT-4。在该方法中,Whisper作为“耳朵”负责音频转录,而GPT-4则充当“大脑”,作为具备强大上下文输出选择与校正能力的标注器。实验表明,与现有方法相比,LyricWhiz在英语歌词转录中显著降低了词错误率,并能有效实现跨多语言的歌词转录。此外,我们基于MTG-Jamendo数据集,利用LyricWhiz创建了首个采用CC-BY-NC-SA版权许可的公开大规模多语言歌词转录数据集,并提供了用于噪声水平估计与评估的人工标注子集。我们预期,所提出的方法与数据集将推动这一具有挑战性的新兴任务——多语言歌词转录——的发展。