Ramsa is a developing 41-hour speech corpus of Emirati Arabic designed to support sociolinguistic research and low-resource language technologies. It contains recordings from structured interviews with native speakers and episodes from national television shows. The corpus features 157 speakers (59 female, 98 male), spans subdialects such as Urban, Bedouin, and Mountain/Shihhi, and covers topics such as cultural heritage, agriculture and sustainability, daily life, professional trajectories, and architecture. It consists of 91 monologic and 79 dialogic recordings, varying in length and recording conditions. A 10\% subset was used to evaluate commercial and open-source models for automatic speech recognition (ASR) and text-to-speech (TTS) in a zero-shot setting to establish initial baselines. Whisper-large-v3-turbo achieved the best ASR performance, with average word and character error rates of 0.268 and 0.144, respectively. MMS-TTS-Ara reported the best mean word and character rates of 0.285 and 0.081, respectively, for TTS. These baselines are competitive but leave substantial room for improvement. The paper highlights the challenges encountered and provides directions for future work.
翻译:Ramsa是一个正在构建的41小时阿联酋阿拉伯语语音语料库,旨在支持社会语言学研究和低资源语言技术开发。它包含来自与母语者的结构化访谈以及国家电视节目片段的录音。该语料库涵盖157名说话者(59名女性,98名男性),涉及城市、贝都因和山区/希希等次方言,主题覆盖文化遗产、农业与可持续性、日常生活、职业轨迹和建筑。它包含91段独白和79段对话录音,时长和录音条件各异。使用10%的子集在零样本设置下评估了商业和开源模型的自动语音识别(ASR)与文本转语音(TTS)性能,以建立初始基线。Whisper-large-v3-turbo在ASR中取得了最佳性能,平均词错误率和字符错误率分别为0.268和0.144。MMS-TTS-Ara在TTS中报告了最佳平均词错误率和字符错误率,分别为0.285和0.081。这些基线具有竞争力,但仍有显著的改进空间。本文重点阐述了所遇到的挑战,并为未来工作提供了方向。