With the demand for autonomous control and personalized speech generation, the style control and transfer in Text-to-Speech (TTS) is becoming more and more important. In this paper, we propose a new TTS system that can perform style transfer with interpretability and high fidelity. Firstly, we design a TTS system that combines variational autoencoder (VAE) and diffusion refiner to get refined mel-spectrograms. Specifically, a two-stage and a one-stage system are designed respectively, to improve the audio quality and the performance of style transfer. Secondly, a diffusion bridge of quantized VAE is designed to efficiently learn complex discrete style representations and improve the performance of style transfer. To have a better ability of style transfer, we introduce ControlVAE to improve the reconstruction quality and have good interpretability simultaneously. Experiments on LibriTTS dataset demonstrate that our method is more effective than baseline models.
翻译:随着自主控制与个性化语音生成需求的增长,文本到语音(TTS)中的风格控制与迁移变得愈发重要。本文提出一种新型TTS系统,能够以高可解释性与高保真度实现风格迁移。首先,我们设计了一种融合变分自编码器(VAE)与扩散精炼器的TTS系统,用以生成精炼的梅尔频谱图。具体而言,分别设计了双阶段与单阶段系统以提升音频质量与风格迁移性能。其次,构建了基于量化VAE的扩散桥,从而高效学习复杂的离散风格表征,提升风格迁移性能。为进一步增强风格迁移能力,引入ControlVAE以同步改善重建质量与可解释性。在LibriTTS数据集上的实验表明,本方法优于基线模型。