Dialectal Arabic (DA) speech data vary widely in domain coverage, dialect labeling practices, and recording conditions, complicating cross-dataset comparison and model evaluation. To characterize this landscape, we conduct a computational analysis of linguistic ``dialectness'' alongside objective proxies of audio quality on the training splits of widely used DA corpora. We find substantial heterogeneity both in acoustic conditions and in the strength and consistency of dialectal signals across datasets, underscoring the need for standardized characterization beyond coarse labels. To reduce fragmentation and support reproducible evaluation, we introduce Arab Voices, a standardized framework for DA ASR. Arab Voices provides unified access to 31 datasets spanning 14 dialects, with harmonized metadata and evaluation utilities. We further benchmark a range of recent ASR systems, establishing strong baselines for modern DA ASR.
翻译:方言阿拉伯语(DA)语音数据在领域覆盖范围、方言标注规范及录音条件等方面存在显著差异,这为跨数据集比较与模型评估带来了复杂性。为系统刻画这一现状,我们对广泛使用的DA语料库训练集进行了计算分析,通过量化语言"方言性"指标并结合音频质量的客观代理变量展开研究。发现不同数据集在声学条件、方言信号的强度与一致性方面均存在显著异质性,这凸显了超越粗粒度标签进行标准化表征的必要性。为减少数据碎片化并支持可复现评估,我们提出了阿拉伯之声——一个面向DA自动语音识别(ASR)的标准化框架。该框架整合了涵盖14种方言的31个数据集,提供统一的访问接口、规范化的元数据及评估工具。我们进一步对一系列前沿ASR系统进行基准测试,为现代DA ASR建立了可靠的性能基线。