Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.
翻译:零样本文本转语音(TTS)合成旨在无需适应参数即可克隆任意未见说话者的声音。通过将语音波形量化为离散声学标记,并利用语言模型对这些标记进行建模,近期基于语言模型的TTS模型仅凭3秒未见说话者的声学提示便展现出零样本说话者适应能力。然而,这些模型受限于声学提示的长度,难以克隆个人说话风格。本文提出一种基于神经编解码语言模型VALL-E的新型零样本TTS模型,采用多尺度声学提示。我们提出了一种说话者感知文本编码器,通过由多个句子组成的风格提示,在音素层面学习个人说话风格。随后,利用基于VALL-E的声学解码器,在帧级别从音色提示中建模音色并生成语音。实验结果表明,所提方法在自然度和说话者相似度方面均优于基线方法,且通过扩展更长的风格提示可获得更优性能。