Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.
翻译:尽管语音识别系统在标准基准测试中实现了较低的字错误率,但在实际部署中,它们往往在简短且高风险的话语上表现不佳。本文针对一项高风险任务——由美国参与者口述的美国街道名称转录——研究这种失败模式。我们评估了来自OpenAI、Deepgram、Google和Microsoft的15个模型,使用来自语言多样化的美国说话者的录音进行测试,发现平均转录错误率达到44%。我们按地理位置量化了转录失败对下游任务的影响,结果表明错误转录对所有说话者均会造成系统性误差,但对于非英语母语说话者,其导致的路径距离误差是英语母语说话者的两倍。为减轻此类损害,我们提出一种合成数据生成方法,利用开源文本转语音模型生成命名实体的多样化发音。使用少于1000个合成样本进行微调后,对于非英语母语说话者,其街道名称转录准确率相较于基础模型提升了近60%(相对提升)。我们的研究结果揭示了语音系统在基准测试表现与实际应用可靠性之间的关键差距,并展示了一条简单、可扩展的路径来减少高风险转录错误。