Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging problem caused by the duration prediction model in non-autoregressive frameworks, and the information redundancy and dimension explosion problems of existing semantic encoding methods. To address these problems, three progressive methods are proposed. First, we propose Diff-LM-Speech, an autoregressive structure consisting of a language model and diffusion models, which models the semantic embedding into the mel-spectrogram based on a diffusion model to achieve higher audio quality. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Second, we propose Tetra-Diff-Speech, a non-autoregressive structure consisting of four diffusion model-based modules that design a duration diffusion model to achieve diverse prosodic expressions. Finally, we propose Tri-Diff-Speech, a non-autoregressive structure consisting of three diffusion model-based modules that verify the non-necessity of existing semantic encoding models and achieve the best results. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples.
翻译:近期,结合两类离散语音表征并通过两个序列到序列任务解耦文本转语音(TTS)的弱监督训练方法受到广泛关注。然而,现有方法面临三个问题:离散语音表征的高维度和波形失真、非自回归框架中时长预测模型导致的韵律平均化问题,以及现有语义编码方法的信息冗余与维度爆炸问题。为解决这些问题,本文提出三种渐进性方法。首先,提出Diff-LM-Speech——一种由语言模型和扩散模型构成的自回归结构,通过扩散模型将语义嵌入建模为梅尔频谱图以实现更高音频质量,同时引入基于变分自编码器和韵律瓶颈的提示编码器结构以增强提示表征能力。其次,提出Tetra-Diff-Speech——一种包含四个基于扩散模型模块的非自回归结构,通过设计时长扩散模型实现多样化韵律表达。最后,提出Tri-Diff-Speech——一种包含三个基于扩散模型模块的非自回归结构,验证了现有语义编码模型的非必要性并取得最佳效果。实验结果表明,所提方法均优于基线方法。我们提供包含音频样本的演示网站。