In this paper, we propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs. UTTS is a multi-speaker speech synthesizer that supports zero-shot voice cloning, it is developed from a perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. We leverage recent advancements in self-supervised speech representation learning as well as speech synthesis front-end techniques for system development. Specifically, we employ our recently formulated Conditional Disentangled Sequential Variational Auto-encoder (C-DSVAE) as the backbone UTTS AM, which offers well-structured content representations given unsupervised alignment (UA) as condition during training. For UTTS inference, we utilize a lexicon to map input text to the phoneme sequence, which is expanded to the frame-level forced alignment (FA) with a speaker-dependent duration model. Then, we develop an alignment mapping module that converts FA to UA. Finally, the C-DSVAE, serving as the self-supervised TTS AM, takes the predicted UA and a target speaker embedding to generate the mel spectrogram, which is ultimately converted to waveform with a neural vocoder. We show how our method enables speech synthesis without using a paired TTS corpus in AM development stage. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations. Audio samples are available at our demo page https://neurtts.github.io/utts\_demo/.
翻译:本文提出了一种新颖的无监督文本转语音声学模型训练方案,命名为UTTS,该方案无需文本-音频配对数据。UTTS是一种支持零样本语音克隆的多说话人语音合成器,其设计基于解耦语音表征学习的视角。该框架为TTS推理提供了说话人时长模型、音色特征(身份)及内容的灵活选择方案。我们利用自监督语音表征学习与语音合成前端技术的最新进展进行系统开发。具体而言,我们采用近期提出的条件解耦序列变分自编码器作为UTTS声学模型的主干网络,该模型在训练过程中以无监督对齐作为条件,可提供结构良好的内容表征。在UTTS推理阶段,我们通过词典将输入文本映射为音素序列,并利用说话人相关的时长模型将其扩展为帧级强制对齐。随后,我们开发了将强制对齐转换为无监督对齐的对齐映射模块。最终,作为自监督TTS声学模型的C-DSVAE根据预测的无监督对齐与目标说话人嵌入生成梅尔频谱图,并通过神经声码器转换为波形。我们展示了该方法如何在声学模型开发阶段实现无需配对TTS语料的语音合成。实验表明,通过主客观评估,UTTS合成的语音在自然度和可懂度方面均表现优异。音频样本可在演示页面https://neurtts.github.io/utts\_demo/获取。