Neural networks have been able to generate high-quality single-sentence speech with substantial expressiveness. However, it remains a challenge concerning paragraph-level speech synthesis due to the need for coherent acoustic features while delivering fluctuating speech styles. Meanwhile, training these models directly on over-length speech leads to a deterioration in the quality of synthesis speech. To address these problems, we propose a high-quality and expressive paragraph speech synthesis system with a multi-step variational autoencoder. Specifically, we employ multi-step latent variables to capture speech information at different grammatical levels before utilizing these features in parallel to generate speech waveform. We also propose a three-step training method to improve the decoupling ability. Our model was trained on a single-speaker French audiobook corpus released at Blizzard Challenge 2023. Experimental results underscore the significant superiority of our system over baseline models.
翻译:神经网络已能够生成具有显著表现力的高质量单句语音。然而,段落级语音合成仍面临挑战,原因在于其既需要连贯的声学特征,又需传递抑扬顿挫的语音风格。同时,直接对超长语音进行模型训练会导致合成语音质量下降。为解决这些问题,我们提出了一种基于多步变分自编码器的高质量、富有表现力的段落语音合成系统。具体而言,我们利用多步潜在变量从不同语法层级捕获语音信息,再并行使用这些特征生成语音波形。我们还提出了一种三步训练方法以提升解耦能力。本模型在Blizzard Challenge 2023发布的单说话人法语有声书语料库上完成训练。实验结果表明,本系统相比基线模型具有显著优势。