Typically, voice conversion is regarded as an engineering problem with limited training data. The reliance on massive amounts of data hinders the practical applicability of deep learning approaches, which have been extensively researched in recent years. On the other hand, statistical methods are effective with limited data but have difficulties in modelling complex mapping functions. This paper proposes a voice conversion method that works with limited data and is based on stochastic variational deep kernel learning (SVDKL). At the same time, SVDKL enables the use of deep neural networks' expressive capability as well as the high flexibility of the Gaussian process as a Bayesian and non-parametric method. When the conventional kernel is combined with the deep neural network, it is possible to estimate non-smooth and more complex functions. Furthermore, the model's sparse variational Gaussian process solves the scalability problem and, unlike the exact Gaussian process, allows for the learning of a global mapping function for the entire acoustic space. One of the most important aspects of the proposed scheme is that the model parameters are trained using marginal likelihood optimization, which considers both data fitting and model complexity. Considering the complexity of the model reduces the amount of training data by increasing the resistance to overfitting. To evaluate the proposed scheme, we examined the model's performance with approximately 80 seconds of training data. The results indicated that our method obtained a higher mean opinion score, smaller spectral distortion, and better preference tests than the compared methods.
翻译:通常,语音转换被视为一个训练数据有限的工程问题。对海量数据的依赖限制了近年来被广泛研究的深度学习方法的实际应用。另一方面,统计方法在数据有限时是有效的,但在建模复杂映射函数方面存在困难。本文提出了一种基于随机变分深度核学习(SVDKL)且适用于有限数据的语音转换方法。同时,SVDKL能够利用深度神经网络的强大表达能力以及高斯过程作为贝叶斯非参数方法的高度灵活性。当传统核函数与深度神经网络结合时,可以估计非平滑且更复杂的函数。此外,模型的稀疏变分高斯过程解决了可扩展性问题,并且与精确高斯过程不同,它允许学习整个声学空间的全局映射函数。所提方案最重要的一点是,模型参数通过边际似然优化进行训练,该优化同时考虑了数据拟合和模型复杂度。通过考虑模型复杂度,提高了对过拟合的抵抗能力,从而减少了训练数据量。为了评估所提方案,我们使用约80秒的训练数据检验了模型性能。结果表明,与对比方法相比,我们的方法获得了更高的平均意见分、更小的频谱失真以及更好的偏好测试结果。