Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.
翻译:由神经网络驱动的语音合成系统在多媒体制作中具有广阔前景,但常面临生成富有表现力语音及实现无缝编辑的挑战。为此,我们提出跨语句条件变分自编码器语音合成(CUC-VAE S2)框架,以增强韵律特性并确保自然语音生成。该框架利用预训练语言模型的强大表征能力以及变分自编码器(VAE)的重构表达能力。CUC-VAE S2框架的核心组件是跨语句条件变分自编码器(CUC-VAE),它从相邻语句中提取声学特征、说话人特征和文本特征以生成上下文敏感的韵律特征,更精确地模拟人类韵律生成过程。我们进一步提出两种针对不同语音合成应用场景的实用算法:面向文本到语音的CUC-VAE TTS算法,以及面向语音编辑的CUC-VAE SE算法。CUC-VAE TTS是框架的直接应用,旨在根据上下文文本生成具有语境韵律的音频;而CUC-VAE SE算法则利用基于上下文信息的真实梅尔频谱图采样,生成接近真实声音的音频,从而实现基于文本的灵活语音编辑(如删除、插入和替换)。在LibriTTS数据集上的实验结果表明,所提模型显著提升了语音合成与编辑效果,能生成更自然且富有表现力的语音。