The difficulty of acquiring abundant, high-quality data, especially in multi-lingual contexts, has sparked interest in addressing low-resource scenarios. Moreover, current literature rely on fixed expressions from language IDs, which results in the inadequate learning of language representations, and the failure to generate speech in unseen languages. To address these challenges, we propose a novel method that directly extracts linguistic features from audio input while effectively filtering out miscellaneous acoustic information including speaker-specific attributes like timbre. Subjective and objective evaluations affirm the effectiveness of our approach for multi-lingual text-to-speech, and highlight its superiority in low-resource transfer learning for previously unseen language.
翻译:在多语言环境下获取大量高质量数据的困难,尤其是在低资源场景中,已引发广泛关注。此外,现有研究多依赖于语言标识符的固定表达,这导致语言表征学习不充分,并无法生成未见语言的语音。为应对这些挑战,我们提出一种新方法,直接从音频输入中提取语言特征,同时有效滤除包括音色等说话人特定属性在内的杂散声学信息。主客观评估证实了该方法在多语言文本到语音合成中的有效性,并突显了其在未见语言的低资源迁移学习中的优越性。