Cross-Utterance Conditioned VAE for Speech Generation

Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.

翻译：由神经网络驱动的语音合成系统在多媒体制作中具有广阔前景，但常面临生成富有表现力语音及实现无缝编辑的挑战。为此，我们提出跨语句条件变分自编码器语音合成（CUC-VAE S2）框架，以增强韵律特性并确保自然语音生成。该框架利用预训练语言模型的强大表征能力以及变分自编码器（VAE）的重构表达能力。CUC-VAE S2框架的核心组件是跨语句条件变分自编码器（CUC-VAE），它从相邻语句中提取声学特征、说话人特征和文本特征以生成上下文敏感的韵律特征，更精确地模拟人类韵律生成过程。我们进一步提出两种针对不同语音合成应用场景的实用算法：面向文本到语音的CUC-VAE TTS算法，以及面向语音编辑的CUC-VAE SE算法。CUC-VAE TTS是框架的直接应用，旨在根据上下文文本生成具有语境韵律的音频；而CUC-VAE SE算法则利用基于上下文信息的真实梅尔频谱图采样，生成接近真实声音的音频，从而实现基于文本的灵活语音编辑（如删除、插入和替换）。在LibriTTS数据集上的实验结果表明，所提模型显著提升了语音合成与编辑效果，能生成更自然且富有表现力的语音。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日