Multilingual automatic lyrics transcription (ALT) is a challenging task due to the limited availability of labelled data and the challenges introduced by singing, compared to multilingual automatic speech recognition. Although some multilingual singing datasets have been released recently, English continues to dominate these collections. Multilingual ALT remains underexplored due to the scale of data and annotation quality. In this paper, we aim to create a multilingual ALT system with available datasets. Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario by expanding the target vocabulary set. We then evaluate the performance of the multilingual model in comparison to its monolingual counterparts. Additionally, we explore various conditioning methods to incorporate language information into the model. We apply analysis by language and combine it with the language classification performance. Our findings reveal that the multilingual model performs consistently better than the monolingual models trained on the language subsets. Furthermore, we demonstrate that incorporating language information significantly enhances performance.
翻译:多语言自动歌词转录(ALT)是一项具有挑战性的任务,与多语言自动语音识别相比,其困难在于标注数据的有限性以及歌唱引入的复杂性。尽管近期发布了一些多语言歌唱数据集,但英语在这些数据集中仍占据主导地位。由于数据规模和标注质量的限制,多语言ALT的研究仍显不足。本文旨在利用现有数据集构建一个多语言ALT系统。受已证明对英语ALT有效的架构启发,我们通过扩展目标词汇集,将这些技术适配到多语言场景。随后,我们评估了多语言模型的性能,并与单语言对应模型进行比较。此外,我们探索了多种条件化方法,以将语言信息融入模型。我们进行了按语言的分析,并将其与语言分类性能相结合。我们的研究结果表明,多语言模型的性能持续优于在语言子集上训练的单语言模型。此外,我们证明了融入语言信息能显著提升性能。