The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the input text, enabling it to incorporate the semantic context and maintain seamless concatenation with the surrounding context. Following that, CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into waveforms, taking into consideration the acoustic context. Our experimental results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms of speech resynthesis from semantic tokens. Moreover, we show that UniCATS achieves state-of-the-art performance in both speech continuation and editing.
翻译:离散语音标记(分为语义标记和声学标记)的运用已被证明在文本到语音合成的自然度和鲁棒性上优于传统声学特征梅尔频谱图。当前主流模型如VALL-E和SPEAR-TTS通过自回归延续从短语音提示中提取的声学标记,实现了零样本说话人自适应。然而,此类自回归模型仅能生成从左至右方向的语音,使其不适用于需同时提供前后文语境的语音编辑任务。此外,这些模型依赖的声学标记受限于音频编解码器模型的性能,存在音频质量瓶颈。本研究提出统一上下文感知TTS框架UniCATS,可同时实现语音延续与编辑。UniCATS包含两个组件:声学模型CTX-txt2vec与声码器CTX-vec2wav。CTX-txt2vec采用上下文VQ-扩散从输入文本预测语义标记,从而整合语义上下文并保持与周边语境的连续拼接。随后CTX-vec2wav利用上下文声码技术将这些语义标记转换为波形,同时考虑声学上下文。实验结果表明,CTX-vec2wav在基于语义标记的语音重合成任务中性能优于HifiGAN和AudioLM。此外,UniCATS在语音延续与编辑两项任务中均达到业界最优水平。