Recently there has been a lot of interest in non-autoregressive (non-AR) models for speech synthesis, such as FastSpeech 2 and diffusion models. Unlike AR models, these models do not have autoregressive dependencies among outputs which makes inference efficient. This paper expands the range of available non-AR models with another member called energy-based models (EBMs). The paper describes how noise contrastive estimation, which relies on the comparison between positive and negative samples, can be used to train EBMs. It proposes a number of strategies for generating effective negative samples, including using high-performing AR models. It also describes how sampling from EBMs can be performed using Langevin Markov Chain Monte-Carlo (MCMC). The use of Langevin MCMC enables to draw connections between EBMs and currently popular diffusion models. Experiments on LJSpeech dataset show that the proposed approach offers improvements over Tacotron 2.
翻译:近年来,非自回归(non-AR)语音合成模型(如FastSpeech 2和扩散模型)引起了广泛关注。与自回归模型不同,这些模型的输出之间不存在自回归依赖性,从而提高了推理效率。本文将基于能量的模型(EBMs)作为非自回归模型的新成员,拓展了可用模型的范围。本文阐述了如何利用依赖于正负样本对比的噪声对比估计来训练EBMs,并提出了多种生成有效负样本的策略(包括使用高性能自回归模型)。同时,本文介绍了如何通过Langevin马尔可夫链蒙特卡洛(MCMC)方法从EBMs中进行采样,该方法揭示了EBMs与当前流行的扩散模型之间的内在联系。在LJSpeech数据集上的实验表明,所提出的方法相比Tacotron 2取得了改进效果。