We evaluate whether compact domain-specialized ASR models can outperform massively multilingual foundation models for conversational African speech across 19 languages in the WAXAL corpus. Fine-tuned edge models achieve a macro-averaged WER of $38.0\%$ compared to $64.9\%$ for the best zero-shot baseline, a $26.9$ percentage-point reduction using models $3-40\times$ smaller. Results confirm that domain specialization dominates scale for spontaneous African speech. Cross-domain evaluation shows that fine-tuned models recover usable performance on out-of-distribution (OOD) speech, while zero-shot models regain an advantage when the test domain matches their pretraining distribution. A distributed native-speaker audit across all surveyed languages produces a linguistically-grounded error taxonomy, showing that CTC and autoregressive architectures behave differently across language families. We further show that WER alone misrepresents performance for syllabary-script languages where CER/WER ratios reveal substantially higher character-level accuracy than headline WER suggests. Finally, to contribute to future African ASR research, we release all model weights, fine-tuning and evaluation scripts, and a cleaned WAXAL subset covering all $19$ languages.
翻译:我们评估了紧凑型领域专用自动语音识别(ASR)模型在面对WAXAL语料库中跨越19种语言的会话式非洲语音时,能否优于大规模多语言基础模型。微调后的边缘端模型在宏观平均词错误率上达到$38.0\%$,而最佳零样本基线为$64.9\%$,即使用比基线模型小3-40倍的模型实现了$26.9$个百分点的降幅。结果表明,对于自发性非洲语音,领域专业化效果优于模型规模。跨领域评估显示,微调模型在分布外语音上可恢复可用性能,而零样本模型在测试领域与其预训练分布匹配时重新获得优势。一项覆盖所有调查语言的分布式母语者审计产生了基于语言学的错误分类体系,表明CTC架构与自回归架构在不同语系间表现各异。我们进一步证明,对于音节文字语言,仅凭词错误率会误判性能——其字符错误率/词错误率比值揭示了比标题词错误率所显示的显著更高的字符级准确率。最后,为促进未来非洲ASR研究,我们开源了所有模型权重、微调与评估脚本,以及覆盖全部19种语言的清洁版WAXAL数据子集。