We propose a novel autoregressive modeling approach for speech synthesis, combining a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution. Unlike previous methods that rely on residual vector quantization, our model leverages continuous speech representations from the VAE's latent space, greatly simplifying the training and inference pipelines. We also introduce a stochastic monotonic alignment mechanism to enforce strict monotonic alignments. Our approach significantly outperforms the state-of-the-art autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3\% of VALL-E's parameters. This demonstrates the potential of continuous speech language models as a more efficient alternative to existing quantization-based speech language models. Sample audio can be found at https://tinyurl.com/gmm-lm-tts.
翻译:本文提出了一种新颖的自回归语音合成建模方法,该方法将变分自编码器(VAE)与多模态隐空间相结合,并采用高斯混合模型(GMM)作为条件概率分布的自回归模型。与以往依赖残差向量量化的方法不同,我们的模型利用了VAE隐空间中的连续语音表示,从而极大地简化了训练和推理流程。我们还引入了一种随机单调对齐机制,以强制执行严格的单调对齐。在主观和客观评估中,我们的方法均显著优于当前最先进的自回归模型VALL-E,且仅使用VALL-E参数的10.3%就实现了这些结果。这证明了连续语音语言模型作为现有基于量化的语音语言模型的一种更高效替代方案的潜力。示例音频可在 https://tinyurl.com/gmm-lm-tts 找到。