We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. Leveraging 5,640 hours of crawled Creative Commons speech and combining it with publicly available datasets, we pre-train conformer-based BEST-RQ models up to 600M parameters. Our models are evaluated on dialect identification (DID) and automatic speech recognition (ASR) tasks, achieving state-of-the-art performance on the former while using fewer parameters than competing models. We demonstrate that family-targeted pre-training on Arabic dialects significantly improves downstream performance compared to multilingual or monolingual models trained on non-Arabic data. All models, code, and pre-processed datasets will be publicly released to support reproducibility and further research in Arabic speech technologies.
翻译:我们提出Ara-BEST-RQ,一个专为多方言阿拉伯语语音处理设计的自监督学习(SSL)模型家族。通过利用5,640小时的网络抓取Creative Commons语音数据,并结合公开可用数据集,我们预训练了基于Conformer的BEST-RQ模型,参数量高达6亿。我们在方言识别(DID)和自动语音识别(ASR)任务上评估了这些模型,在前一任务中取得了最先进的性能,同时使用的参数量少于竞争模型。我们证明,针对阿拉伯语方言的家族定向预训练相比基于非阿拉伯语数据训练的多语言或单语言模型,能显著提升下游性能。所有模型、代码及预处理后的数据集将公开发布,以支持阿拉伯语音技术的可重复性与进一步研究。