Automatic speech recognition (ASR) systems remain brittle on dysarthric and other atypical speech. Recent audio-language models raise the possibility of improving performance by conditioning on additional clinical context at inference time, but it is unclear whether these models can make use of such information. We introduce a benchmark built on the Speech Accessibility Project (SAP) dataset that tests whether diagnosis labels, clinician-derived speech ratings, and progressively richer clinical descriptions improve transcription accuracy for dysarthric speech. Across matched comparisons on nine models, we find that current models do not meaningfully use this context: diagnosis-informed and clinically detailed prompts yield negligible improvements and often degrade word error rate. We complement the prompting analysis with context-dependent fine-tuning, showing that LoRA adaptation with a mixture of clinical prompt formats achieves a WER of 0.066, a 52% relative reduction over the frozen baseline, while preserving performance when context is unavailable. Subgroup analyses reveal significant gains for Down syndrome and mild-severity speakers. These results clarify where current models fall short and provide a testbed for measuring progress toward more inclusive ASR.
翻译:自动语音识别(ASR)系统在构音障碍及其他非典型语音中仍存在脆弱性问题。近期的音频-语言模型提出了通过在推理时加入额外临床上下文来提升性能的可能性,但尚不清楚这些模型能否有效利用此类信息。我们基于语音可访问性项目(SAP)数据集构建了一个基准测试,用于检验诊断标签、临床专家评分的语音评估以及逐步丰富的临床描述是否能提升构音障碍语音的转录准确性。通过对九个模型进行配对比较,我们发现当前模型并未有意义地利用这类上下文:基于诊断和临床详细描述的提示词仅带来微乎其微的改进,且经常导致词错误率上升。我们通过上下文相关的微调对提示分析进行补充,表明采用混合临床提示格式的LoRA适配方法在不可获取上下文时仍能保持性能,同时实现0.066的词错误率,相比冻结基线降低52%。亚组分析显示唐氏综合征患者及轻度严重程度说话者获得了显著改善。这些结果揭示了当前模型的不足之处,并为衡量更包容性ASR系统的进展提供了测试平台。