Automatic speech recognition (ASR) plays a vital role in enabling natural human-machine interaction across applications such as virtual assistants, industrial automation, customer support, and real-time transcription. However, developing accurate ASR systems for low-resource languages like Arabic remains a significant challenge due to limited labeled data and the linguistic complexity introduced by diverse dialects. In this work, we present a scalable training pipeline that combines weakly supervised learning with supervised fine-tuning to develop a robust Arabic ASR model. In the first stage, we pretrain the model on 15,000 hours of weakly labeled speech covering both Modern Standard Arabic (MSA) and various Dialectal Arabic (DA) variants. In the subsequent stage, we perform continual supervised fine-tuning using a mixture of filtered weakly labeled data and a small, high-quality annotated dataset. Our approach achieves state-of-the-art results, ranking first in the multi-dialectal Arabic ASR challenge. These findings highlight the effectiveness of weak supervision paired with fine-tuning in overcoming data scarcity and delivering high-quality ASR for low-resource, dialect-rich languages.
翻译:自动语音识别(ASR)在实现跨应用(如虚拟助手、工业自动化、客户支持和实时转录)的自然人机交互中发挥着至关重要的作用。然而,对于像阿拉伯语这样的低资源语言,由于标注数据有限以及多样方言带来的语言复杂性,开发准确的 ASR 系统仍然是一项重大挑战。在本工作中,我们提出了一种可扩展的训练流程,该流程结合了弱监督学习与监督微调,以开发一个鲁棒的阿拉伯语 ASR 模型。在第一阶段,我们在 15,000 小时的弱标注语音数据上对模型进行预训练,这些数据涵盖了现代标准阿拉伯语(MSA)和各种方言阿拉伯语(DA)变体。在后续阶段,我们使用经过筛选的弱标注数据与一个高质量的小型标注数据集的混合,进行持续的监督微调。我们的方法取得了最先进的结果,在多方言阿拉伯语 ASR 挑战赛中排名第一。这些发现凸显了弱监督与微调相结合在克服数据稀缺性、为低资源且方言丰富的语言提供高质量 ASR 方面的有效性。