Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.
翻译:现有的印度语自动语音识别基准通常使用脚本化、清洁语音及排行榜驱动评估,这助长了针对特定数据集的过拟合现象。此外,严格的单参考词错误率会惩罚印度语言中的自然拼写变体,包括来自英语码混词的非标准化拼写。为解决这些局限性,我们提出了Voice of India——一个基于非脚本化电话对话构建的闭源基准,覆盖印度15种主要语言,涵盖139个区域聚类。该数据集包含306,230条语音片段,总计536小时来自36,691位说话人的语音,其转写文本兼顾了拼写变体。我们还从地理维度分析了县级表现,揭示了差异。最后,我们针对音频质量、语速、性别和设备类型等因素进行了详细分析,揭示了当前自动语音识别系统存在的薄弱环节,为改进真实场景下的印度语自动语音识别系统提供了重要启示。