We compare phone labels and articulatory features as input for cross-lingual transfer learning in text-to-speech (TTS) for low-resource languages (LRLs). Experiments with FastSpeech 2 and the LRL West Frisian show that using articulatory features outperformed using phone labels in both intelligibility and naturalness. For LRLs without pronunciation dictionaries, we propose two novel approaches: a) using a massively multilingual model to convert grapheme-to-phone (G2P) in both training and synthesizing, and b) using a universal phone recognizer to create a makeshift dictionary. Results show that the G2P approach performs largely on par with using a ground-truth dictionary and the phone recognition approach, while performing generally worse, remains a viable option for LRLs less suitable for the G2P approach. Within each approach, using articulatory features as input outperforms using phone labels.
翻译:我们对比了音素标签和发音特征作为跨语言迁移学习输入在低资源语言文本到语音合成中的应用效果。基于FastSpeech 2模型和低资源语言西弗里斯兰语的实验表明,使用发音特征作为输入在可懂度和自然度方面均优于使用音素标签。针对缺乏发音词典的低资源语言,我们提出了两种新方法:a) 在训练和合成阶段使用大规模多语言模型进行字素到音素转换;b) 使用通用音素识别器构建临时词典。结果显示,基于字素到音素转换的方法与使用真实词典的效果基本持平,而音素识别方法虽然整体表现较差,但仍是字素到音素转换方法适用性较差的低资源语言的可行选择。在各类方法中,采用发音特征作为输入均优于使用音素标签。