Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTS\footnote{https://docs.coqui.ai/en/latest/models/xtts.html}\footnote{https://medium.com/machine-learns/xtts-v2-new-version-of-the-open-source-text-to-speech-model-af73914db81f}\footnote{https://medium.com/@erogol/xtts-v1-techincal-notes-eb83ff05bdc} model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an in-house dialectal dataset. Our automated and human evaluation results show convincing performance while capable of generating dialectal speech. Our study highlights significant potential for improvements in this emerging area of research in Arabic.
翻译:零样本多说话人文本到语音(ZS-TTS)系统在英语领域已取得显著进展,然而由于资源匮乏,其在其他语言中的应用仍相对滞后。本研究针对拥有超过4.5亿母语使用者的阿拉伯语,首先通过调整现有大规模数据集以满足语音合成需求来填补这一空白。此外,我们采用一套阿拉伯语方言识别模型,探究预定义方言标签在多方言环境下对改进ZS-TTS模型的影响。随后,我们对开源架构XTTS模型进行微调,并在包含31位未见说话人的数据集及内部方言数据集上进行评估。自动评估与人工评估结果均表明,该模型在生成方言语音方面表现出令人信服的性能。本研究凸显了阿拉伯语这一新兴研究领域具有显著的改进潜力。