PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech

from arxiv, Benchmark dataset and evaluation suite. Data and code available at: https://huggingface.co/datasets/prdeepakbabu/ProfASR-Bench https://github.com/prdeepakbabu/ProfASR-Bench

Automatic Speech Recognition (ASR) in professional settings faces challenges that existing benchmarks underplay: dense domain terminology, formal register variation, and near-zero tolerance for critical entity errors. We present ProfASR-Bench, a professional-talk evaluation suite for high-stakes applications across finance, medicine, legal, and technology. Each example pairs a natural-language prompt (domain cue and/or speaker profile) with an entity-rich target utterance, enabling controlled measurement of context-conditioned recognition. The corpus supports conventional ASR metrics alongside entity-aware scores and slice-wise reporting by accent and gender. Using representative families Whisper (encoder-decoder ASR) and Qwen-Omni (audio language models) under matched no-context, profile, domain+profile, oracle, and adversarial conditions, we find a consistent pattern: lightweight textual context produces little to no change in average word error rate (WER), even with oracle prompts, and adversarial prompts do not reliably degrade performance. We term this the context-utilization gap (CUG): current systems are nominally promptable yet underuse readily available side information. ProfASR-Bench provides a standardized context ladder, entity- and slice-aware reporting with confidence intervals, and a reproducible testbed for comparing fusion strategies across model families. Dataset: https://huggingface.co/datasets/prdeepakbabu/ProfASR-Bench Code: https://github.com/prdeepakbabu/ProfASR-Bench

翻译：专业场景下的自动语音识别（ASR）面临着现有基准未能充分体现的挑战：密集的领域术语、正式语体的变体以及对关键实体错误的近乎零容忍度。本文提出ProfASR-Bench，一个面向金融、医疗、法律和技术等高风险应用的专业对话评估套件。每个示例将一个自然语言提示（领域线索和/或说话者画像）与一个富含实体的目标话语配对，从而实现对上下文条件化识别的受控测量。该语料库支持传统的ASR指标，以及实体感知评分和按口音、性别划分的分片报告。在匹配的无上下文、画像、领域+画像、理想及对抗性条件下，使用代表性模型家族Whisper（编码器-解码器ASR）和Qwen-Omni（音频语言模型）进行实验，我们发现一致的模式：轻量级文本上下文对平均词错误率（WER）几乎没有影响，即使使用理想提示也是如此，且对抗性提示并不能可靠地降低性能。我们将此称为上下文利用差距（CUG）：当前系统名义上可接受提示，却未能充分利用现成的辅助信息。ProfASR-Bench提供了一个标准化的上下文阶梯、带有置信区间的实体与分片感知报告，以及一个可复现的测试平台，用于比较不同模型家族间的融合策略。数据集：https://huggingface.co/datasets/prdeepakbabu/ProfASR-Bench 代码：https://github.com/prdeepakbabu/ProfASR-Bench