This paper presents the development of a speech synthesis system for the LIMMITS'24 Challenge, focusing primarily on Track 2. The objective of the challenge is to establish a multi-speaker, multi-lingual Indic Text-to-Speech system with voice cloning capabilities, covering seven Indian languages with both male and female speakers. The system was trained using challenge data and fine-tuned for few-shot voice cloning on target speakers. Evaluation included both mono-lingual and cross-lingual synthesis across all seven languages, with subjective tests assessing naturalness and speaker similarity. Our system uses the VITS2 architecture, augmented with a multi-lingual ID and a BERT model to enhance contextual language comprehension. In Track 1, where no additional data usage was permitted, our model achieved a Speaker Similarity score of 4.02. In Track 2, which allowed the use of extra data, it attained a Speaker Similarity score of 4.17.
翻译:本文介绍了为LIMMITS'24挑战赛开发的语音合成系统,主要聚焦于赛道2。该挑战赛的目标是建立一个具备语音克隆能力的多说话人、多语言印度语系文本转语音系统,涵盖七种印度语言,并包含男声和女声。该系统使用挑战赛提供的数据进行训练,并针对目标说话人进行了少样本语音克隆的微调。评估内容包括在所有七种语言上的单语言及跨语言合成,并通过主观测试评估了自然度和说话人相似度。我们的系统采用VITS2架构,并辅以多语言ID和BERT模型来增强上下文语言理解能力。在赛道1(不允许使用额外数据)中,我们的模型获得了4.02的说话人相似度得分。在赛道2(允许使用额外数据)中,该模型获得了4.17的说话人相似度得分。