In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.
翻译:本文描述NVIDIA为MMITS-VC(多说话人、多语言印度语TTS与语音克隆)2024挑战赛开发的TTS模型。在赛道1和2中,我们利用RAD-MMM进行少样本TTS,额外在5分钟目标说话人数据上进行训练。在赛道3中,我们利用P-Flow进行零样本TTS,在挑战赛数据集及外部数据集上训练。所有提交方案均使用HiFi-GAN声码器。RAD-MMM在赛道1和2中表现具有竞争力,而P-Flow在赛道3中排名第一,平均意见得分(MOS)为4.4,说话人相似度得分(SMOS)为3.62。