Neural networks have been able to generate high-quality single-sentence speech. However, it remains a challenge concerning audio-book speech synthesis due to the intra-paragraph correlation of semantic and acoustic features as well as variable styles. In this paper, we propose a highly expressive paragraph speech synthesis system with a multi-step variational autoencoder, called EP-MSTTS. EP-MSTTS is the first VITS-based paragraph speech synthesis model and models the variable style of paragraph speech at five levels: frame, phoneme, word, sentence, and paragraph. We also propose a series of improvements to enhance the performance of this hierarchical model. In addition, we directly train EP-MSTTS on speech sliced by paragraph rather than sentence. Experiment results on the single-speaker French audiobook corpus released at Blizzard Challenge 2023 show EP-MSTTS obtains better performance than baseline models.
翻译:神经网络已能生成高质量的单句语音。然而,由于语义与声学特征的段落内相关性以及多变的风格,有声书语音合成仍面临挑战。本文提出一种基于多步变分自编码器的高表现力段落语音合成系统,称为EP-MSTTS。EP-MSTTS是首个基于VITS的段落语音合成模型,在五个层级对段落语音的多样风格进行建模:帧、音素、词语、句子与段落。我们还提出一系列改进以增强该分层模型的性能。此外,我们直接使用按段落而非句子切分的语音数据训练EP-MSTTS。在Blizzard Challenge 2023发布的单说话人法语有声书语料上的实验结果表明,EP-MSTTS相比基线模型取得了更优的性能。