Neural networks have been able to generate high-quality single-sentence speech with high expressiveness. However, it remains a challenge concerning paragraph-level speech synthesis due to the need for coherent acoustic features while delivering sentence styles. Meanwhile, training those models directly on over-length speech suffers from degrading synthesizing quality. This paper proposes a high-quality and expressive paragraph speech synthesis system with a multi-step variational autoencoder. Our approach employs multi-step latent variables to capture speech information and predicts them with text information separately at different grammatical levels. We also propose a three-step training method to promote the performance of the decoupling process. The proposed TTS model was trained on a single-speaker French audiobook corpus released at Blizzard Challenge 2023. Experimental results underscore the significant superiority of our system over baseline models.
翻译:神经网络已能生成高质量且富有表现力的单句语音。然而,由于需要在传递句子风格的同时保持声学特征的一致性,段落级语音合成仍然是一个挑战。同时,直接在超长语音上训练这些模型会导致合成质量下降。本文提出了一种基于多步变分自编码器的高质量且富有表现力的段落语音合成系统。我们的方法采用多步潜变量来捕获语音信息,并在不同语法层级上分别使用文本信息对其进行预测。我们还提出了一种三步训练方法,以提升解耦过程的性能。所提出的TTS模型在Blizzard Challenge 2023发布的单说话人法语有声书语料库上进行了训练。实验结果凸显了我们的系统相较于基线模型的显著优势。