Although many Automatic Speech Recognition (ASR) systems have been developed for Modern Standard Arabic (MSA) and Dialectal Arabic (DA), few studies have focused on dialect-specific implementations, particularly for low-resource Arabic dialects such as Sudanese. This paper presents a comprehensive study of data augmentation techniques for fine-tuning OpenAI Whisper models and establishes the first benchmark for the Sudanese dialect. Two augmentation strategies are investigated: (1) self-training with pseudo-labels generated from unlabeled speech, and (2) TTS-based augmentation using synthetic speech from the Klaam TTS system. The best-performing model, Whisper-Medium fine-tuned with combined self-training and TTS augmentation (28.4 hours), achieves a Word Error Rate (WER) of 57.1% on the evaluation set and 51.6% on an out-of-domain holdout set substantially outperforming zero-shot multilingual Whisper (78.8% WER) and MSA-specialized Arabic models (73.8-123% WER). All experiments used low-cost resources (Kaggle free tier and Lightning.ai trial), demonstrating that strategic data augmentation can overcome resource limitations for low-resource dialects and provide a practical roadmap for developing ASR systems for low-resource Arabic dialects and other marginalized language varieties. The models, evaluation benchmarks, and reproducible training pipelines are publicly released to facilitate future research on low-resource Arabic ASR.
翻译:尽管已针对现代标准阿拉伯语(MSA)和方言阿拉伯语(DA)开发了许多自动语音识别(ASR)系统,但专注于特定方言实现的研究仍较为有限,尤其对于苏丹语等低资源阿拉伯语方言。本文系统研究了用于微调OpenAI Whisper模型的数据增强技术,并首次建立了苏丹方言的基准测试体系。研究探讨了两种增强策略:(1)利用未标注语音生成伪标签的自训练方法;(2)基于Klaam TTS系统合成语音的TTS增强技术。性能最佳的模型——通过自训练与TTS增强联合微调的Whisper-Medium模型(28.4小时训练数据)——在评估集上实现了57.1%的词错误率(WER),在域外保留集上达到51.6%的WER,显著优于零样本多语言Whisper模型(78.8% WER)和专用MSA阿拉伯语模型(73.8-123% WER)。所有实验均采用低成本计算资源(Kaggle免费层级和Lightning.ai试用版),证明战略性的数据增强能够克服低资源方言的资源限制,为开发低资源阿拉伯语方言及其他边缘化语言变体的ASR系统提供了实用路线图。本研究公开释放了所有模型、评估基准及可复现训练流程,以促进低资源阿拉伯语ASR领域的未来研究。