In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and Langevin dynamics. It uses weighted denoising score matching to train a score approximator, a fully convolutional network with a U-Net structure designed to predict the gradient of the log density of the speech feature sequences of multiple speakers, and performs VC by using annealed Langevin dynamics to iteratively update an input feature sequence towards the nearest stationary point of the target distribution based on the trained score approximator network. Thanks to the nature of this concept, VoiceGrad enables any-to-many VC, a VC scenario in which the speaker of input speech can be arbitrary, and allows for non-parallel training, which requires no parallel utterances or transcriptions.
翻译:本文提出了一种名为VoiceGrad的非平行任意到多人语音转换方法。受新近提出的波形生成方法WaveGrad启发,VoiceGrad基于分数匹配与朗之万动力学概念。它采用加权去噪分数匹配训练一个分数近似器——该近似器为具有U-Net结构的全卷积网络,旨在预测多说话人语音特征序列对数密度的梯度——并通过退火朗之万动力学,基于训练好的分数近似器网络,迭代更新输入特征序列直至其逼近目标分布的最近稳态点,从而实现语音转换。凭借这一概念的固有特性,VoiceGrad可实现任意到多人语音转换(即输入语音的说话人可以是任意个体),并支持非平行训练,无需平行语句或文本标注。