The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the input text, enabling it to incorporate the semantic context and maintain seamless concatenation with the surrounding context. Following that, CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into waveforms, taking into consideration the acoustic context. Our experimental results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms of speech resynthesis from semantic tokens. Moreover, we show that UniCATS achieves state-of-the-art performance in both speech continuation and editing.
翻译:离散语音令牌(分为语义令牌和声学令牌)的利用已被证明在文本到语音(TTS)合成的自然度和鲁棒性方面优于传统声学特征梅尔频谱图。近期流行的模型(如VALL-E和SPEAR-TTS)通过自回归(AR)延续从短语音提示中提取的声学令牌,实现了零样本说话人自适应。然而,这些AR模型仅能按从左到右方向生成语音,使其不适用于同时提供前后上下文的语音编辑任务。此外,这些模型依赖的声学令牌受限于音频编解码器模型的性能,导致音频质量存在局限。本研究提出统一上下文感知TTS框架UniCATS,该框架能够同时支持语音延续与编辑。UniCATS包含两个组件:声学模型CTX-txt2vec和声码器CTX-vec2wav。CTX-txt2vec采用上下文VQ扩散从输入文本预测语义令牌,从而整合语义上下文并保持与周围上下文的无缝衔接;随后CTX-vec2wav利用上下文声码技术将这些语义令牌转换为波形,同时考虑声学上下文。实验结果表明,CTX-vec2wav在语义令牌语音重合成任务上优于HifiGAN和AudioLM。此外,我们证明UniCATS在语音延续和编辑任务中均达到最优性能。