Word Error Rate (WER) is the dominant metric for automatic speech recognition (ASR), but it can overestimate errors when references and hypotheses encode the same words in different scripts. This issue is common in multilingual settings where ASR models may emit romanized text. We propose Script-Normalized WER (SN-WER), a training-free, evaluation-only scoring method that transliterates both reference and hypothesis text into a language-specific canonical script before computing WER. We evaluate SN-WER on 5 Indic languages, 2 datasets, and 3 ASR models. On curated FLEURS data, SN-WER reduces inflated model gaps by up to 12%, while on noisier Common Voice data the reductions are smaller or inconsistent, indicating genuine recognition weaknesses rather than only script mismatch. Controlled stress tests show a 67% attenuation of artificial romanization-induced WER inflation, while lexical-substitution controls show near-identical sensitivity to semantic errors, with Delta SN-WER / Delta WER approximately 1.09. SN-WER is robust to transliterator choice, normalization changes, and shows low token-collision rates below 0.1% in the evaluated Indic setting. We argue that SN-WER should be reported alongside WER and CER as a companion metric for script-insensitive ASR evaluation, especially when transcripts feed downstream search, indexing, or multilingual LLM pipelines.
翻译:摘要:词错误率(WER)是自动语音识别(ASR)的主要评估指标,但当参考文本和假设文本在不同脚本中编码相同词语时,该指标可能高估错误。这一问题在ASR模型可能输出罗马化文本的多语言场景中尤为常见。我们提出脚本归一化词错误率(SN-WER),这是一种无需训练、仅用于评估的评分方法,在计算WER前将参考文本和假设文本音译为特定语言的规范脚本。我们在5种印度语言、2个数据集和3个ASR模型上评估了SN-WER。在精选的FLEURS数据上,SN-WER将膨胀的模型性能差异最多降低12%;而在噪声较大的Common Voice数据上,降低幅度较小或不一致,这表明模型存在真实的识别缺陷而非仅脚本不匹配。受控压力测试显示,人工罗马化导致的WER膨胀衰减67%,而词汇替换对照实验表明SN-WER对语义错误的敏感度与WER近乎一致,Delta SN-WER / Delta WER约为1.09。SN-WER对音译器选择与归一化方式具有鲁棒性,且在所评估的印度语言场景中词元冲突率低于0.1%。我们认为,SN-WER应作为WER和CER的伴随指标共同报告,用于脚本无关的ASR评估,尤其当转录文本服务于下游搜索、索引或多语言大语言模型流水线时。