This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models. We analyze the performance of three ASR model families: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER), along with a detailed examination of the most frequent wrong words and error types including insertions, deletions, and substitutions. Our analysis is conducted using two types of datasets, read speech and conversational speech. Notably, we present the first conversational speech dataset designed for benchmarking Urdu ASR models. We find that seamless-large outperforms other ASR models on the read speech dataset, while whisper-large performs best on the conversational speech dataset. Furthermore, this evaluation highlights the complexities of assessing ASR models for low-resource languages like Urdu using quantitative metrics alone and emphasizes the need for a robust Urdu text normalization system. Our findings contribute valuable insights for developing robust ASR systems for low-resource languages like Urdu.
翻译:本文对乌尔都语自动语音识别模型进行了全面评估。我们使用词错误率分析了三种 ASR 模型系列(Whisper、MMS 和 Seamless-M4T)的性能,并对插入、删除和替换等最常见错误词汇及错误类型进行了详细考察。评估采用朗读语音和会话语音两类数据集进行,其中我们首次构建了专用于乌尔都语 ASR 模型基准测试的会话语音数据集。研究发现,在朗读语音数据集上 seamless-large 模型表现最优,而在会话语音数据集上 whisper-large 模型性能最佳。此外,本次评估揭示了仅依靠定量指标评估乌尔都语等低资源语言 ASR 模型的复杂性,并强调了构建鲁棒的乌尔都语文本规范化系统的必要性。本研究为开发乌尔都语等低资源语言的鲁棒 ASR 系统提供了重要见解。