Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging problem caused by the duration prediction model in non-autoregressive frameworks, and the information redundancy and dimension explosion problems of existing semantic encoding methods. To address these problems, three progressive methods are proposed. First, we propose Diff-LM-Speech, an autoregressive structure consisting of a language model and diffusion models, which models the semantic embedding into the mel-spectrogram based on a diffusion model to achieve higher audio quality. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Second, we propose Tetra-Diff-Speech, a non-autoregressive structure consisting of four diffusion model-based modules that design a duration diffusion model to achieve diverse prosodic expressions. Finally, we propose Tri-Diff-Speech, a non-autoregressive structure consisting of three diffusion model-based modules that verify the non-necessity of existing semantic encoding models and achieve the best results. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples.
翻译:近年来,通过结合两类离散语音表征并利用两个序列到序列任务解耦文本转语音(TTS)的最小监督训练方法引起了广泛关注。然而现有方法存在三个问题:离散语音表征的高维性与波形失真、非自回归框架中时长预测模型导致的韵律平均化问题,以及现有语义编码方法的信息冗余与维度爆炸。针对这些问题,本文提出三种渐进式方法。首先提出Diff-LM-Speech——由语言模型与扩散模型构成的自回归结构,通过基于扩散模型的语义嵌入到梅尔频谱图的建模实现更高音频质量;同时引入基于变分自编码器与韵律瓶颈的提示编码器结构以提升提示表征能力。其次提出Tetra-Diff-Speech——由四个基于扩散模型的模块构成的非自回归结构,通过设计时长扩散模型实现多样化韵律表达。最后提出Tri-Diff-Speech——由三个基于扩散模型的模块构成的非自回归结构,验证了现有语义编码模型的非必要性并取得最佳效果。实验结果表明,所提方法优于基线方法。我们提供包含音频样本的展示网站。