This research presents a few-shot voice cloning system for Nepali speakers, designed to synthesize speech in a specific speaker's voice from Devanagari text using minimal data. Voice cloning in Nepali remains largely unexplored due to its low-resource nature. To address this, we constructed separate datasets: untranscribed audio for training a speaker encoder and paired text-audio data for training a Tacotron2-based synthesizer. The speaker encoder, optimized with Generative End2End loss, generates embeddings that capture the speaker's vocal identity, validated through Uniform Manifold Approximation and Projection (UMAP) for dimension reduction visualizations. These embeddings are fused with Tacotron2's text embeddings to produce mel-spectrograms, which are then converted into audio using a WaveRNN vocoder. Audio data were collected from various sources, including self-recordings, and underwent thorough preprocessing for quality and alignment. Training was performed using mel and gate loss functions under multiple hyperparameter settings. The system effectively clones speaker characteristics even for unseen voices, demonstrating the feasibility of few-shot voice cloning for the Nepali language and establishing a foundation for personalized speech synthesis in low-resource scenarios.
翻译:本研究提出了一种面向尼泊尔语说话人的少样本语音克隆系统,旨在利用少量数据,根据天城文文本合成具有特定说话人音色的语音。由于尼泊尔语属于低资源语言,其语音克隆研究在很大程度上尚未得到充分探索。为解决此问题,我们构建了独立的数据集:用于训练说话人编码器的未转录音频,以及用于训练基于Tacotron2的合成器的配对文本-音频数据。该说话人编码器通过生成式端到端损失函数进行优化,可生成捕捉说话人声学身份的嵌入向量,并通过均匀流形逼近与投影(UMAP)进行降维可视化验证。这些嵌入向量与Tacotron2的文本嵌入融合,以生成梅尔频谱图,随后使用WaveRNN声码器将其转换为音频。音频数据采集自多种来源(包括自行录制),并经过严格的预处理以确保质量与对齐。训练过程在多种超参数设置下,采用梅尔损失函数和门控损失函数进行。该系统即使对未见过的说话人声音也能有效克隆其声学特征,证明了尼泊尔语少样本语音克隆的可行性,并为低资源场景下的个性化语音合成奠定了基础。