Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. To address the challenges associated with high dimensionality and waveform distortion in discrete representations, we propose Diff-LM-Speech, which models semantic embeddings into mel-spectrogram based on diffusion models and introduces a prompt encoder structure based on variational autoencoders and prosody bottlenecks to improve prompt representation capabilities. Autoregressive language models often suffer from missing and repeated words, while non-autoregressive frameworks face expression averaging problems due to duration prediction models. To address these issues, we propose Tetra-Diff-Speech, which designs a duration diffusion model to achieve diverse prosodic expressions. While we expect the information content of semantic coding to be between that of text and acoustic coding, existing models extract semantic coding with a lot of redundant information and dimensionality explosion. To verify that semantic coding is not necessary, we propose Tri-Diff-Speech. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples.
翻译:近期,结合两种离散语音表征并利用两个序列到序列任务解耦文本转语音(TTS)的最小监督训练方法受到广泛关注。针对离散表征存在的高维度和波形畸变问题,我们提出Diff-LM-Speech模型,该模型基于扩散模型将语义嵌入映射为梅尔频谱图,并引入基于变分自编码器和韵律瓶颈的提示编码器结构以提升提示表征能力。自回归语言模型常存在词缺失和重复问题,而非自回归框架因时长预测模型导致表达平均化。为解决这些问题,我们提出Tetra-Diff-Speech模型,通过设计时长扩散模型实现多样化韵律表达。尽管我们期望语义编码的信息含量介于文本编码与声学编码之间,但现有模型提取的语义编码存在大量冗余信息和维度爆炸问题。为验证语义编码的非必要性,我们提出Tri-Diff-Speech模型。实验结果表明,所提方法优于基线方法。我们提供包含音频样本的网站。