Improving ASR systems is necessary to make new LLM-based use-cases accessible to people across the globe. In this paper, we focus on Indian languages, and make the case that diverse benchmarks are required to evaluate and improve ASR systems for Indian languages. To address this, we collate Vistaar as a set of 59 benchmarks across various language and domain combinations, on which we evaluate 3 publicly available ASR systems and 2 commercial systems. We also train IndicWhisper models by fine-tuning the Whisper models on publicly available training datasets across 12 Indian languages totalling to 10.7K hours. We show that IndicWhisper significantly improves on considered ASR systems on the Vistaar benchmark. Indeed, IndicWhisper has the lowest WER in 39 out of the 59 benchmarks, with an average reduction of 4.1 WER. We open-source all datasets, code and models.
翻译:提升ASR系统对于让全球用户都能使用基于LLM的新型应用至关重要。本文聚焦印度语言,论证了需要多样化基准来评估和改进印度语言的ASR系统。为此,我们整理出Vistaar——包含59个跨不同语言和领域组合的基准集,并基于此评估了3个公开ASR系统和2个商业系统。我们还通过微调Whisper模型,在涵盖12种印度语言的公开训练数据集(总计10.7K小时)上训练了IndicWhisper模型。实验表明,IndicWhisper在Vistaar基准上显著优于现有ASR系统:在59个基准中,有39个实现了最低的词错误率(WER),平均WER降低4.1。所有数据集、代码和模型均已开源。