A notable gap persists in speech synthesis research and development for Arabic dialects, particularly from a unified modeling perspective. Despite its high practical value, the inherent linguistic complexity of Arabic dialects, further compounded by a lack of standardized data, benchmarks, and evaluation guidelines, steers researchers toward safer ground. To bridge this divide, we present Habibi, a suite of specialized and unified text-to-speech models that harnesses existing open-source ASR corpora to support a wide range of high- to low-resource Arabic dialects through linguistically-informed curriculum learning. Our approach outperforms the leading commercial service in generation quality, while maintaining extensibility through effective in-context learning, without requiring text diacritization. We are committed to open-sourcing the model, along with creating the first systematic benchmark for multi-dialect Arabic speech synthesis. Furthermore, by identifying the key challenges in and establishing evaluation standards for the process, we aim to provide a solid groundwork for subsequent research. Resources at https://SWivid.github.io/Habibi/ .
翻译:阿拉伯语方言的语音合成研究与开发领域存在显著空白,尤其缺乏统一建模视角的研究。尽管具有很高的实用价值,但阿拉伯语方言固有的语言复杂性,加之缺乏标准化数据、基准测试和评估准则,使得研究者往往选择更稳妥的研究路径。为弥合这一鸿沟,我们提出了Habibi——一套专业化、统一化的文本转语音模型套件。该套件利用现有开源自动语音识别语料库,通过基于语言学的课程学习策略,支持从高资源到低资源的广泛阿拉伯语方言。我们的方法在生成质量上超越了领先的商业服务,同时通过有效的上下文学习保持可扩展性,且无需文本标注音标符号。我们承诺将模型开源,并创建首个系统性的多方言阿拉伯语语音合成基准测试。此外,通过识别该过程中的关键挑战并建立评估标准,我们旨在为后续研究奠定坚实基础。项目资源详见 https://SWivid.github.io/Habibi/。