Voice cloning is a prominent feature in personalized speech interfaces. A neural vocal cloning system can mimic someone's voice using just a few audio samples. Both speaker encoding and speaker adaptation are topics of research in the field of voice cloning. Speaker adaptation relies on fine-tuning a multi-speaker generative model, which involves training a separate model to infer a new speaker embedding used for speaker encoding. Both methods can achieve excellent performance, even with a small number of cloning audios, in terms of the speech's naturalness and similarity to the original speaker. Speaker encoding approaches are more appropriate for low-resource deployment since they require significantly less memory and have a faster cloning time than speaker adaption, which can offer slightly greater naturalness and similarity. The main goal is to create a vocal cloning system that produces audio output with a Nepali accent or that sounds like Nepali. For the further advancement of TTS, the idea of transfer learning was effectively used to address several issues that were encountered in the development of this system, including the poor audio quality and the lack of available data.
翻译:语音克隆是个性化语音界面的重要功能。神经语音克隆系统仅需少量音频样本即可模仿特定人的声音。说话人编码和说话人适配都是语音克隆领域的研究课题。说话人适配依赖于对多说话人生成模型进行微调,这需要训练一个独立模型来推断用于说话人编码的新说话人嵌入向量。两种方法即使在克隆音频数量较少的情况下,也能在语音自然度和与原始说话人相似度方面取得优异表现。说话人编码方法更适合低资源部署场景,因为与说话人适配相比,其所需内存显著减少且克隆速度更快,而说话人适配在自然度和相似度方面可能略有优势。本研究的主要目标是开发能生成带有尼泊尔口音或具有尼泊尔语特征的语音克隆系统。为推进TTS技术的发展,我们有效运用迁移学习理念解决了系统开发中遇到的若干问题,包括音频质量不佳和可用数据匮乏等挑战。