Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources concentrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic -- the most widely understood Arabic dialect -- severely under-resourced. We address this gap by introducing NileTTS: 38 hours of transcribed speech from two speakers across diverse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natural speech using audio synthesis tools, followed by automatic transcription and speaker diarization with manual quality verification. We fine-tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic dialects. Our contributions include: (1) the first publicly available Egyptian Arabic TTS dataset, (2) a reproducible synthetic data generation pipeline for dialectal TTS, and (3) an open-source fine-tuned model. All resources are released to advance Egyptian Arabic speech synthesis research.
翻译:尽管神经文本转语音技术取得了进展,但许多阿拉伯语方言变体仍未得到充分研究,现有资源大多集中于现代标准阿拉伯语和海湾方言,导致埃及阿拉伯语——这一理解最广泛的阿拉伯语方言——资源严重匮乏。为填补这一空白,我们推出了NileTTS数据集:包含来自两位说话者的38小时转写语音,涵盖医疗、销售及日常对话等多个领域。我们通过创新的合成流水线构建该数据集:首先利用大语言模型生成埃及阿拉伯语内容,随后通过音频合成工具转换为自然语音,再经过自动转写和说话人日志处理,并辅以人工质量验证。我们在该数据集上对最先进的多语言TTS模型XTTS v2进行微调,并与基于其他阿拉伯语方言训练的基线模型进行对比评估。本研究的贡献包括:(1)首个公开可用的埃及阿拉伯语TTS数据集;(2)可复现的方言TTS合成数据生成流水线;(3)开源微调模型。所有资源均已公开发布,以推动埃及阿拉伯语语音合成研究的发展。