Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous large-scale multispeaker TTS models have successfully achieved this goal with an enrolled recording within 10 seconds. However, most of them are designed to utilize only short speech prompts. The limited information in short speech prompts significantly hinders the performance of fine-grained identity imitation. In this paper, we introduce Mega-TTS 2, a generic zero-shot multispeaker TTS model that is capable of synthesizing speech for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a multi-reference timbre encoder to extract timbre information from multiple reference speeches; 2) and train a prosody language model with arbitrary-length speech prompts; With these designs, our model is suitable for prompts of different lengths, which extends the upper bound of speech quality for zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce arbitrary-source prompts, which leverages the probabilities derived from multiple P-LLM outputs to produce expressive and controlled prosody. Furthermore, we propose a phoneme-level auto-regressive duration model to introduce in-context learning capabilities to duration modeling. Experiments demonstrate that our method could not only synthesize identity-preserving speech with a short prompt of an unseen speaker but also achieve improved performance with longer speech prompts. Audio samples can be found in https://mega-tts.github.io/mega2_demo/.
翻译:零样本文本转语音旨在利用未见过的语音提示合成说话人声音。以往的大规模多说话人TTS模型已成功通过10秒内的注册录音实现该目标。然而,多数模型仅设计用于处理短语音提示。短语音提示中有限的信息严重制约了细粒度身份模仿的性能。本文提出Mega-TTS 2——一种通用的零样本多说话人TTS模型,能够利用任意长度的提示合成未见说话人的语音。具体而言,我们:1)设计了多参考音色编码器,从多个参考语音中提取音色信息;2)训练了支持任意长度语音提示的韵律语言模型。通过这些设计,我们的模型可适配不同长度的提示,从而扩展零样本文本转语音的语音质量上限。除任意长度提示外,我们还引入任意源提示,利用多个P-LLM输出的概率生成富有表现力且可控的韵律。此外,我们提出音素级自回归时长模型,将上下文学习能力引入时长建模。实验表明,本方法不仅能通过未见说话人的短提示合成保留身份特征的语音,还能利用长语音提示进一步提升性能。音频样本请见 https://mega-tts.github.io/mega2_demo/。