The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution in the training data. To address these issues, in this paper, we propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis. Firstly, a text style encoder is pre-trained with a large amount of unlabeled text-only data. Secondly, a spectrogram style extractor based on VQ-VAE is pre-trained in a self-supervised manner, with plenty of audio data that covers complex style variations. Then a novel architecture with two encoder-decoder paths is specially designed to model the pronunciation and high-level style expressiveness respectively, with the guidance of the style extractor. Both objective and subjective evaluations demonstrate that our proposed method can effectively improve the naturalness and expressiveness of the synthesized speech in audiobook synthesis especially for the role and out-of-domain scenarios.
翻译:有声书合成语音的表现力受限于通用模型架构和训练数据中不平衡的风格分布。针对这些问题,本文提出一种基于VQ-VAE预训练的自监督风格增强方法,用于实现富有表现力的有声书语音合成。首先,利用大量无标注纯文本数据对文本风格编码器进行预训练;其次,基于VQ-VAE的频谱图风格提取器以自监督方式,通过涵盖复杂风格变化的丰富音频数据完成预训练;随后,我们创新性地设计了一种包含两个编码器-解码器路径的架构,在风格提取器的引导下分别对发音和高层次风格表现力进行建模。客观评估与主观评测均表明,所提方法能显著提升有声书合成语音的自然度和表现力,尤其在角色扮演和跨域场景中表现突出。