We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vector quantizer for NTTS, as an enhancement to the well-known Variational Autoencoder (VAE) and Vector Quantized Variational Autoencoder (VQ-VAE) architectures. Compared to these previous architectures, our proposed model retains the benefits of using an utterance-level bottleneck, while keeping significant representation power and a discretized latent space small enough for efficient prediction from text. We train the model on recordings in the expressive task-oriented dialogues domain and show that SVQ-VAE achieves a statistically significant improvement in naturalness over the VAE and VQ-VAE models. Furthermore, we demonstrate that the SVQ-VAE latent acoustic space is predictable from text, reducing the gap between the standard constant vector synthesis and vocoded recordings by 32%.
翻译:我们提出了一种基于分裂向量量化的变分自编码器(SVQ-VAE)架构,用于神经文本转语音(NTTS)中的分裂向量量化器,作为经典变分自编码器(VAE)和向量量化变分自编码器(VQ-VAE)架构的增强方案。与以往架构相比,本模型在保留使用话语级瓶颈优势的同时,保持了强大的表示能力,并将离散潜在空间缩小至足以实现从文本的高效预测。我们在表达性任务导向对话领域的录音上训练模型,实验表明SVQ-VAE在自然度上较VAE和VQ-VAE模型具有统计显著性提升。此外,我们证明了SVQ-VAE的潜在声学空间可从文本预测,将标准常数向量合成与声码器录音之间的差距缩小了32%。