Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.
翻译:尽管语音识别系统在标准基准测试中实现了较低的字错误率,但在实际部署中,它们常常在简短且高风险的话语上失败。本文针对一项高风险任务——美国参与者口述的美国街道名称转录——研究这种失败模式。我们评估了来自OpenAI、Deepgram、Google和Microsoft的15个模型,使用来自语言多样化的美国说话者的录音,发现平均转录错误率为44%。我们按地理位置量化了转录失败的下游影响,结果表明错误转录对所有说话者均造成系统性误差,但非英语母语者的路径距离误差是英语母语者的两倍。为减轻这种损害,我们提出一种合成数据生成方法,利用开源文本转语音模型生成命名实体的多样化发音。使用少于1000个合成样本进行微调后,非英语母语者的街道名称转录准确率相比基础模型提升了近60%。我们的研究结果揭示了语音系统在基准性能与实际可靠性之间的关键差距,并展示了一条简单、可扩展的路径来减少高风险转录错误。