Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic \& acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from information redundancy and dimension explosion in semantic representation, and high-frequency waveform distortion in discrete acoustic representation. Autoregressive frameworks exhibit typical instability and uncontrollability issues. And non-autoregressive frameworks suffer from prosodic averaging caused by duration prediction models. To address these issues, we propose a minimally-supervised high-fidelity speech synthesis method, where all modules are constructed based on the diffusion models. The non-autoregressive framework enhances controllability, and the duration diffusion model enables diversified prosodic expression. Contrastive Token-Acoustic Pretraining (CTAP) is used as an intermediate semantic representation to solve the problems of information redundancy and dimension explosion in existing semantic coding methods. Mel-spectrogram is used as the acoustic representation. Both semantic and acoustic representations are predicted by continuous variable regression tasks to solve the problem of high-frequency fine-grained waveform distortion. Experimental results show that our proposed method outperforms the baseline method. We provide audio samples on our website.
翻译:文本转语音(TTS)方法在语音克隆方面展现出可喜成果,但需要大量标注文本-语音对。微监督语音合成通过结合两种离散语音表征(语义与声学)并利用两个序列到序列任务解耦TTS,从而在极少量监督下实现训练。然而现有方法存在语义表征信息冗余与维度爆炸、离散声学表征高频波形失真等问题。自回归框架存在典型的不稳定性与不可控性,而非自回归框架则因时长预测模型导致韵律平均化。针对上述问题,本文提出一种微监督高保真语音合成方法,其中所有模块均基于扩散模型构建。非自回归框架增强了可控性,时长扩散模型实现多样化韵律表达。采用对比令牌-声学预训练(CTAP)作为中间语义表征,以解决现有语义编码方法的信息冗余与维度爆炸问题;声学表征则使用梅尔频谱图。语义与声学两种表征均通过连续变量回归任务进行预测,从而解决高频细粒度波形失真问题。实验结果表明,所提方法优于基线方法。音频样本可在本网站获取。