Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE

In this paper, we propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs. UTTS is a multi-speaker speech synthesizer that supports zero-shot voice cloning, it is developed from a perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. We leverage recent advancements in self-supervised speech representation learning as well as speech synthesis front-end techniques for system development. Specifically, we employ our recently formulated Conditional Disentangled Sequential Variational Auto-encoder (C-DSVAE) as the backbone UTTS AM, which offers well-structured content representations given unsupervised alignment (UA) as condition during training. For UTTS inference, we utilize a lexicon to map input text to the phoneme sequence, which is expanded to the frame-level forced alignment (FA) with a speaker-dependent duration model. Then, we develop an alignment mapping module that converts FA to UA. Finally, the C-DSVAE, serving as the self-supervised TTS AM, takes the predicted UA and a target speaker embedding to generate the mel spectrogram, which is ultimately converted to waveform with a neural vocoder. We show how our method enables speech synthesis without using a paired TTS corpus in AM development stage. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations. Audio samples are available at our demo page https://neurtts.github.io/utts\_demo/.

翻译：本文提出了一种新颖的无监督文本转语音声学模型训练方案，命名为UTTS，该方案无需文本-音频配对数据。UTTS是一种支持零样本语音克隆的多说话人语音合成器，其设计基于解耦语音表征学习的视角。该框架为TTS推理提供了说话人时长模型、音色特征（身份）及内容的灵活选择方案。我们利用自监督语音表征学习与语音合成前端技术的最新进展进行系统开发。具体而言，我们采用近期提出的条件解耦序列变分自编码器作为UTTS声学模型的主干网络，该模型在训练过程中以无监督对齐作为条件，可提供结构良好的内容表征。在UTTS推理阶段，我们通过词典将输入文本映射为音素序列，并利用说话人相关的时长模型将其扩展为帧级强制对齐。随后，我们开发了将强制对齐转换为无监督对齐的对齐映射模块。最终，作为自监督TTS声学模型的C-DSVAE根据预测的无监督对齐与目标说话人嵌入生成梅尔频谱图，并通过神经声码器转换为波形。我们展示了该方法如何在声学模型开发阶段实现无需配对TTS语料的语音合成。实验表明，通过主客观评估，UTTS合成的语音在自然度和可懂度方面均表现优异。音频样本可在演示页面https://neurtts.github.io/utts\_demo/获取。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日