We compare using a PHOIBLE-based phone mapping method and using phonological features input in transfer learning for TTS in low-resource languages. We use diverse source languages (English, Finnish, Hindi, Japanese, and Russian) and target languages (Bulgarian, Georgian, Kazakh, Swahili, Urdu, and Uzbek) to test the language-independence of the methods and enhance the findings' applicability. We use Character Error Rates from automatic speech recognition and predicted Mean Opinion Scores for evaluation. Results show that both phone mapping and features input improve the output quality and the latter performs better, but these effects also depend on the specific language combination. We also compare the recently-proposed Angular Similarity of Phone Frequencies (ASPF) with a family tree-based distance measure as a criterion to select source languages in transfer learning. ASPF proves effective if label-based phone input is used, while the language distance does not have expected effects.
翻译:我们比较了基于PHOIBLE的音素映射方法与基于音系特征输入的迁移学习在低资源语言文本转语音(TTS)中的应用。通过使用多种源语言(英语、芬兰语、印地语、日语和俄语)及目标语言(保加利亚语、格鲁吉亚语、哈萨克语、斯瓦希里语、乌尔都语和乌兹别克语),我们测试了这些方法的语言无关性,并增强了研究结果的普适性。采用自动语音识别的字符错误率与预测平均意见得分进行评估。结果表明,音素映射与音系特征输入均能提升输出质量,且后者表现更优,但其效果也取决于具体语言组合。此外,我们比较了近期提出的音素频率角度相似性(ASPF)与基于语系树的距离度量,作为迁移学习中源语言选择的标准。实验证明,若使用基于标签的音素输入,ASPF方法有效,而语言距离则未产生预期效果。