FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model

Neural text to speech (TTS) generally consists of cascaded architecture with separately optimized acoustic model and vocoder or end-to-end architecture with continuous mel-spectrograms or self-extracted speech frames as the intermediate representations to bridge acoustic model and vocoder in joint training, which suffers from two limitations: 1) the continuous acoustic frames are hard to predict with phoneme only, acoustic information are also needed like duration or pitch to solve the one-to-many problem, which is not easy to scale on large scale and noise dataset; 2) diverse speech output is not straightforward with continuous speech features and complex VAE or flow based models are often needed. In this paper, we propose FoundationTTS, a new speech synthesis system with discrete speech tokens extraction from neural audio codec and a large language modelling based acoustic model for simultaneously optimizing linguistic and acoustic tokens. Specifically, 1) we propose a hierarchical codec network based on vector-quantized auto-encoders with adversarial training (VQ-GAN) to first extract continuous frame-level speech representations with fine-grained codec, and the coarse-grained codec reconstructs the continuous speech frame with fewer quantizers; 2) we jointly optimize speech token, linguistic tokens, speaker token together with a large language model and autoregressively predict the discrete speech tokens. Experiments show that FoundationTTS achieves a MOS gain of +0.14 compared to the baseline system. In ASR customization tasks, our method achieves 7.09\% and 10.35\% WERR respectively over two strong customized ASR baselines.

翻译：神经文本转语音（TTS）通常采用级联架构（分别优化声学模型与声码器）或端到端架构（以连续梅尔频谱或自提取语音帧作为中间表征，通过联合训练连接声学模型与声码器），但存在两个局限：1）仅凭音素难以预测连续声学帧，还需时长、音高等声学信息解决一对多映射问题，导致在大规模噪声数据集上难以扩展；2）基于连续语音特征的多样化语音输出不够直接，常需复杂的变分自编码器（VAE）或流模型。本文提出FoundationTTS——一种新型语音合成系统，通过神经音频编解码器提取离散语音令牌，并采用基于大语言建模的声学模型同步优化语言与声学令牌。具体而言：1）提出基于矢量量化对抗训练自编码器（VQ-GAN）的层次化编解码网络，首先利用细粒度编解码器提取连续帧级语音表征，再通过粗粒度编解码器使用更少量化器重建连续语音帧；2）联合优化语音令牌、语言令牌、说话人令牌与大语言模型，通过自回归方式预测离散语音令牌。实验表明，FoundationTTS相较于基线系统获得+0.14的MOS增益。在ASR定制化任务中，本方法在两个强定制化ASR基线上分别实现7.09%和10.35%的词错误率降低（WERR）。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8