We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on bilingual or parallel examples, ParrotTTS can transfer voices across languages while preserving the speaker specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker's voice and accent. We present extensive results in monolingual and multi-lingual scenarios. ParrotTTS outperforms state-of-the-art multi-lingual TTS models using only a fraction of paired data as latter.
翻译:我们提出了ParrotTTS,一种模块化的文本到语音合成模型,利用解耦的自监督语音表示。它能利用单说话人的转录本有效训练多说话人变体。ParrotTTS在低资源设置下适应新语言,并能泛化到训练自监督骨干网络时未见过的语言。此外,无需训练双语或平行样本,ParrotTTS可在保持说话人特定特征(例如,用法语说话人的声音和口音合成流畅的印地语语音)的同时跨语言迁移音色。我们展示了在单语言和多语言场景下的广泛实验结果。ParrotTTS仅使用后者所需的配对数据的一小部分,就超越了当前最先进的多语言TTS模型。