Arabic Text-to-Speech (TTS) research has been hindered by the availability of both publicly available training data and accurate Arabic diacritization models. In this paper, we address the limitation by exploring Arabic TTS training on large automatically annotated data. Namely, we built a robust pipeline for collecting Arabic recordings and processing them automatically using voice activity detection, speech recognition, automatic diacritization, and noise filtering, resulting in around 4,000 hours of Arabic TTS training data. We then trained several robust TTS models with voice cloning using varying amounts of data, namely 100, 1,000, and 4,000 hours with and without diacritization. We show that though models trained on diacritized data are generally better, larger amounts of training data compensate for the lack of diacritics to a significant degree. We plan to release a public Arabic TTS model that works without the need for diacritization.
翻译:阿拉伯语文本转语音(TTS)研究一直受限于公开可用的训练数据和准确的阿拉伯语变音标注模型。本文通过探索在大规模自动标注数据上进行阿拉伯语TTS训练来解决这一限制。具体而言,我们构建了一个稳健的流程,用于收集阿拉伯语录音,并利用语音活动检测、语音识别、自动变音标注和噪声过滤等技术对其进行自动处理,最终获得了约4000小时的阿拉伯语TTS训练数据。随后,我们使用不同规模的数据(即100小时、1000小时和4000小时),在有变音标注和无变音标注的条件下,训练了多个具备语音克隆功能的稳健TTS模型。研究表明,尽管在变音标注数据上训练的模型通常表现更优,但更大规模的训练数据能在很大程度上弥补变音符号缺失带来的不足。我们计划发布一个无需变音标注即可运行的公开阿拉伯语TTS模型。