Voice conversion (VC) models have demonstrated impressive few-shot conversion quality on the clean, native speech populations they're trained on. However, when source or target speech accents, background noise conditions, or microphone characteristics differ from training, quality voice conversion is not guaranteed. These problems are often left unexamined in VC research, giving rise to frustration in users trying to use pretrained VC models on their own data. We are interested in accent-preserving voice conversion for name pronunciation from self-recorded examples, a domain in which all three of the aforementioned conditions are present, and posit that demonstrating higher performance in this domain correlates with creating VC models that are more usable by otherwise frustrated users. We demonstrate that existing SOTA encoder-decoder VC models can be made robust to these variations and endowed with natural denoising capabilities using more diverse data and simple data augmentation techniques in pretraining.
翻译:语音转换模型在针对训练所采用的纯净母语语音人群上展示了令人印象深刻的少样本转换质量。然而,当源或目标语音的口音、背景噪声条件或麦克风特性与训练数据不同时,高质量的语音转换无法得到保证。这些问题在语音转换研究中往往被忽略,导致用户尝试在自己数据上使用预训练语音转换模型时遭遇挫折。我们关注于基于自录样本的口音保留型姓名发音语音转换,这一领域同时存在上述三种情况,并认为在该领域展示更高性能与创建更易被受挫用户使用的语音转换模型相关。我们证明,现有的最先进编码器-解码器语音转换模型可以通过在预训练中采用更多样化的数据和简单的数据增强技术,对这些变化具有鲁棒性,并具备自然的去噪能力。