Recent progress in self-supervised representation learning has opened up new opportunities for training from unlabeled data and has been a growing trend in voice conversion. However, unsupervised training of voice cloning seems to remain a challenging task. In this paper we propose a semi-supervised zero-shot voice cloning approach that works by adapting a HuBERT-based voice conversion system to the voice cloning task and shows the robustness of such a system to noises both in training data (we add noises resulting in up to 0db signal-to-noise-ratio to 35% of training data with no significant degradation of evaluation metrics) and in the target speaker reference audio at inference. Moreover, such a method does not require any type of denoising or noise-labeling of training data. Finally, we introduce a novel multi-tasking approach by incorporating self-supervised DINO loss into joint training of a CAM++ based speaker verification system and a unit-based VITS cloning system. We show that it significantly improves the quality of generated audio over baselines, especially for noisy target speaker references.
翻译:近期自监督表示学习领域的进展为利用无标签数据进行训练开辟了新机遇,并在语音转换领域呈现增长趋势。然而,无监督训练的人声克隆仍是具有挑战性的任务。本文提出一种半监督零样本人声克隆方法,通过将基于HuBERT的语音转换系统适配至人声克隆任务,证明该系统对训练数据噪声(我们在35%训练数据中加入了导致信噪比低至0dB的噪声,但评估指标未显著退化)及推理阶段目标说话人参考音频噪声均具有鲁棒性。此外,该方法无需任何类型的去噪或训练数据噪声标注。最后,我们引入新型多任务方法,将自监督DINO损失融入基于CAM++的说话人验证系统与基于单元的VITS克隆系统的联合训练。实验表明,该方法显著提升了生成音频质量,尤其是在处理含噪目标说话人参考音频时。