Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics often underestimate ASR performance by treating orthographic variants as errors. To address this issue, we introduce MultiClin, a clinical ASR benchmark designed to evaluate robustness to multiscript variability. Experiments across diverse ASR models show that multiscript-aware evaluation provides a fairer assessment of recognition quality than conventional single-reference evaluation. We further investigate the impact of script consistency during training and find that inconsistent script mappings increase orthographic uncertainty and hinder model convergence, with a balanced 50% mapping ratio producing the highest entropy. In contrast, script unification consistently yields the best ASR performance. Our dataset and code are publicly available at: https://github.com/aitrics-ronaldo/Interspeech_MultiClin.
翻译:自动语音识别(ASR)在非英语临床场景中面临多文字系统变异的挑战,同一术语可能存在多种有效的正字法形式。传统字符串匹配评估指标常将正字法变体视为错误,从而低估ASR性能。针对这一问题,我们提出MultiClin——一个专为评估多文字系统变异性鲁棒性设计的临床ASR基准。跨多种ASR模型的实验表明,相较于传统单参考评估,考虑多文字系统的评估方法能更公正地衡量识别质量。我们进一步探究了训练过程中文字系统一致性的影响,发现不一致的文字映射会增加正字法不确定性并阻碍模型收敛,其中50%的平衡映射比产生最高熵值。相比之下,文字系统统一始终能获得最佳ASR性能。我们的数据集与代码已公开于:https://github.com/aitrics-ronaldo/Interspeech_MultiClin