Text-to-speech (TTS) systems that scale up the amount of training data have achieved significant improvements in zero-shot speech synthesis. However, these systems have certain limitations: they require a large amount of training data, which increases costs, and often overlook prosody similarity. To address these issues, we propose MultiVerse, a zero-shot multi-task TTS system that is able to perform TTS or speech style transfer in zero-shot and cross-lingual conditions. MultiVerse requires much less training data than traditional data-driven approaches. To ensure zero-shot performance even with limited data, we leverage source-filter theory-based disentanglement, utilizing the prompt for modeling filter-related and source-related representations. Additionally, to further enhance prosody similarity, we adopt a prosody modeling approach combining prompt-based autoregressive and non-autoregressive methods. Evaluations demonstrate the remarkable zero-shot multi-task TTS performance of MultiVerse and show that MultiVerse not only achieves zero-shot TTS performance comparable to data-driven TTS systems with much less data, but also significantly outperforms other zero-shot TTS systems trained with the same small amount of data. In particular, our novel prosody modeling technique significantly contributes to MultiVerse's ability to generate speech with high prosody similarity to the given prompts. Our samples are available at https://nc-ai.github.io/speech/publications/multiverse/index.html
翻译:通过扩大训练数据规模,文本转语音(TTS)系统在零样本语音合成方面取得了显著进展。然而,这些系统存在一定的局限性:它们需要大量训练数据,从而增加了成本,并且常常忽略韵律相似性。为了解决这些问题,我们提出了MultiVerse,一个零样本多任务TTS系统,能够在零样本和跨语言条件下执行TTS或语音风格转换。MultiVerse所需训练数据远少于传统数据驱动方法。为了确保即使在有限数据下也能实现零样本性能,我们利用基于源-滤波器理论的解耦方法,利用提示来建模与滤波器相关和与声源相关的表征。此外,为了进一步增强韵律相似性,我们采用了一种结合基于提示的自回归与非自回归方法的韵律建模策略。评估结果表明,MultiVerse具有卓越的零样本多任务TTS性能,它不仅能够以少得多的数据实现与数据驱动TTS系统相媲美的零样本TTS性能,而且在相同少量数据训练下,其表现显著优于其他零样本TTS系统。特别地,我们新颖的韵律建模技术为MultiVerse生成与给定提示具有高韵律相似性的语音做出了重要贡献。我们的样本可在 https://nc-ai.github.io/speech/publications/multiverse/index.html 获取。