Improving ASR systems is necessary to make new LLM-based use-cases accessible to people across the globe. In this paper, we focus on Indian languages, and make the case that diverse benchmarks are required to evaluate and improve ASR systems for Indian languages. To address this, we collate Vistaar as a set of 59 benchmarks across various language and domain combinations, on which we evaluate 3 publicly available ASR systems and 2 commercial systems. We also train IndicWhisper models by fine-tuning the Whisper models on publicly available training datasets across 12 Indian languages totalling to 10.7K hours. We show that IndicWhisper significantly improves on considered ASR systems on the Vistaar benchmark. Indeed, IndicWhisper has the lowest WER in 39 out of the 59 benchmarks, with an average reduction of 4.1 WER. We open-source all datasets, code and models.
翻译:提升语音识别系统(ASR)对于让全球用户能够使用基于大语言模型(LLM)的新型应用至关重要。本文聚焦于印度语言,论证了需要多样化的基准来评估和改善印度语言ASR系统。为此,我们构建了Vistaar基准集,包含跨多种语言与领域组合的59项基准,并在此基础上评估了3个公开ASR系统与2个商业系统。同时,我们通过微调Whisper模型,在涵盖12种印度语言的公开训练数据集上训练了IndicWhisper模型(总计10,700小时)。实验表明,IndicWhisper在Vistaar基准上显著优于所考察的ASR系统:在59项基准中,IndicWhisper在39项上取得最低词错率(WER),平均WER降低4.1%。我们已将全部数据集、代码及模型开源。