Neural networks have been able to generate high-quality single-sentence speech with substantial expressiveness. However, it remains a challenge concerning paragraph-level speech synthesis due to the need for coherent acoustic features while delivering fluctuating speech styles. Meanwhile, training these models directly on over-length speech leads to a deterioration in the quality of synthesis speech. To address these problems, we propose a high-quality and expressive paragraph speech synthesis system with a multi-step variational autoencoder. Specifically, we employ multi-step latent variables to capture speech information at different grammatical levels before utilizing these features in parallel to generate speech waveform. We also propose a three-step training method to improve the decoupling ability. Our model was trained on a single-speaker French audiobook corpus released at Blizzard Challenge 2023. Experimental results underscore the significant superiority of our system over baseline models.
翻译:神经网络已能生成具有显著表现力的高质量单句语音。然而,由于在传递波动性语音风格的同时需要连贯的声学特征,段落级语音合成仍面临挑战。同时,直接在超长语音上训练这些模型会导致合成语音质量下降。为解决上述问题,我们提出了一种基于多步变分自编码器的高质量且富有表现力的段落语音合成系统。具体而言,我们采用多步潜变量在不同语法层级捕获语音信息,随后并行利用这些特征生成语音波形。我们进一步提出三步训练方法以提升解耦能力。该模型在Blizzard Challenge 2023发布的法语单说话人有声书语料库上完成训练。实验结果表明,相较于基线模型,本系统具有显著优越性。