As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA through DSPy to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's kappa of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
翻译:随着自动语音识别(ASR)在临床对话中日益普及,标准评估仍然严重依赖词错误率(WER)。本文挑战这一标准,探究 WER 或其他常见指标是否与转录错误的临床影响相关。我们通过让临床专家比较真实话语与其 ASR 生成版本,在两个不同的医患对话数据集中标记任何差异的临床影响,从而建立了一个黄金标准基准。我们的分析表明,WER 以及一系列现有综合指标与临床医生分配的风险标签(无影响、最小影响或显著影响)相关性很差。为了弥合这一评估差距,我们引入了 LLM-as-a-Judge 方法,通过 DSPy 使用 GEPA 进行程序化优化,以复现专家临床评估。优化后的评判模型(Gemini-2.5-Pro)达到了与人类相当的性能,获得了 90% 的准确率和 0.816 的强 Cohen's kappa 系数。这项工作提供了一个经过验证的自动化框架,将 ASR 评估从简单的文本保真度推进到对临床对话安全性必要且可扩展的评估。