Recently normalizing flows have been gaining traction in text-to-speech (TTS) and voice conversion (VC) due to their state-of-the-art (SOTA) performance. Normalizing flows are unsupervised generative models. In this paper, we introduce supervision to the training process of normalizing flows, without the need for parallel data. We call this training paradigm AutoEncoder Normalizing Flow (AE-Flow). It adds a reconstruction loss forcing the model to use information from the conditioning to reconstruct an audio sample. Our goal is to understand the impact of each component and find the right combination of the negative log-likelihood (NLL) and the reconstruction loss in training normalizing flows with coupling blocks. For that reason we will compare flow-based mapping model trained with: (i) NLL loss, (ii) NLL and reconstruction losses, as well as (iii) reconstruction loss only. Additionally, we compare our model with SOTA VC baseline. The models are evaluated in terms of naturalness, speaker similarity, intelligibility in many-to-many and many-to-any VC settings. The results show that the proposed training paradigm systematically improves speaker similarity and naturalness when compared to regular training methods of normalizing flows. Furthermore, we show that our method improves speaker similarity and intelligibility over the state-of-the-art.
翻译:近年来,归一化流因其在文本转语音(TTS)和语音转换(VC)领域的最先进(SOTA)性能而逐渐受到关注。归一化流是一种无监督生成模型。在本文中,我们在无需并行数据的情况下,将监督引入到归一化流的训练过程中,并将这种训练范式称为自编码器归一化流(AE-Flow)。该方法通过添加重构损失,迫使模型利用条件信息来重构音频样本。我们的目标在于理解各个组成部分的影响,并找到在训练具有耦合块的归一化流时,负对数似然(NLL)损失与重构损失的最佳组合方式。为此,我们将分别比较基于流的映射模型在以下三种训练方式下的表现:(i) 仅使用NLL损失,(ii) 同时使用NLL和重构损失,以及 (iii) 仅使用重构损失。此外,我们还将提出的模型与最先进的VC基线模型进行对比。我们将在多对多和多对一语音转换设置下,从自然度、说话人相似度和可懂度三个方面对模型进行评估。结果表明,与归一化流的常规训练方法相比,所提出的训练范式系统性地提升了说话人相似度和自然度。此外,我们还证明,相较于现有最先进方法,该方法在说话人相似度和可懂度方面均有所改进。