UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the input text, enabling it to incorporate the semantic context and maintain seamless concatenation with the surrounding context. Following that, CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into waveforms, taking into consideration the acoustic context. Our experimental results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms of speech resynthesis from semantic tokens. Moreover, we show that UniCATS achieves state-of-the-art performance in both speech continuation and editing.

翻译：离散语音标记（分为语义标记和声学标记）的运用已被证明在文本到语音合成的自然度和鲁棒性上优于传统声学特征梅尔频谱图。当前主流模型如VALL-E和SPEAR-TTS通过自回归延续从短语音提示中提取的声学标记，实现了零样本说话人自适应。然而，此类自回归模型仅能生成从左至右方向的语音，使其不适用于需同时提供前后文语境的语音编辑任务。此外，这些模型依赖的声学标记受限于音频编解码器模型的性能，存在音频质量瓶颈。本研究提出统一上下文感知TTS框架UniCATS，可同时实现语音延续与编辑。UniCATS包含两个组件：声学模型CTX-txt2vec与声码器CTX-vec2wav。CTX-txt2vec采用上下文VQ-扩散从输入文本预测语义标记，从而整合语义上下文并保持与周边语境的连续拼接。随后CTX-vec2wav利用上下文声码技术将这些语义标记转换为波形，同时考虑声学上下文。实验结果表明，CTX-vec2wav在基于语义标记的语音重合成任务中性能优于HifiGAN和AudioLM。此外，UniCATS在语音延续与编辑两项任务中均达到业界最优水平。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日