Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.
翻译:零样本文本转语音合成旨在无需适配参数即可克隆任意未见说话人的声音。通过将语音波形量化为离散声学标记并利用语言模型对这些标记进行建模,近期基于语言模型的TTS模型仅需3秒未见说话人的声学提示即可展现零样本说话人适应能力。然而,这些模型受限于声学提示长度,难以克隆个人说话风格。本文提出一种基于神经编解码语言模型VALL-E的多尺度声学提示零样本TTS模型。通过设计说话人感知文本编码器,从包含多个句子的风格提示中学习音素级的个人说话风格;随后采用基于VALL-E的声学解码器,在帧级处理音色提示中的音色特征并生成语音。实验结果表明,该方法在自然度和说话人相似度上均优于基线模型,且通过扩展更长风格提示可进一步提升合成效果。