While Automatic Speech Recognition (ASR) is typically benchmarked by word error rate (WER), real-world applications ultimately hinge on semantic fidelity. This mismatch is particularly problematic for dysarthric speech, where articulatory imprecision and disfluencies can cause severe semantic distortions. To bridge this gap, we introduce a Large Language Model (LLM)-based agent for post-ASR correction: a Judge-Editor over the top-k ASR hypotheses that keeps high-confidence spans, rewrites uncertain segments, and operates in both zero-shot and fine-tuned modes. In parallel, we release SAP-Hypo5, the largest benchmark for dysarthric speech correction, to enable reproducibility and future exploration. Under multi-perspective evaluation, our agent achieves a 14.51% WER reduction alongside substantial semantic gains, including a +7.59 pp improvement in MENLI and +7.66 pp in Slot Micro F1 on challenging samples. Our analysis further reveals that WER is highly sensitive to domain shift, whereas semantic metrics correlate more closely with downstream task performance.
翻译:尽管自动语音识别(ASR)通常以词错误率(WER)作为基准指标,但实际应用最终取决于语义保真度。这种不匹配在构音障碍语音识别中尤为突出,因为发音不精确和不流畅可能导致严重的语义失真。为弥合这一差距,我们提出了一种基于大语言模型(LLM)的智能体,用于执行后置ASR校正:该智能体作为针对top-k个ASR假设的“法官-编辑”,保留高置信度的片段,重写不确定的段落,并支持零样本和微调两种工作模式。同时,我们发布了SAP-Hypo5——目前规模最大的构音障碍语音校正基准数据集,以支持研究的可复现性与未来探索。在多视角评估框架下,我们的智能体实现了14.51%的WER降低,并取得显著的语义增益:在挑战性样本上,MENLI指标提升7.59个百分点,Slot Micro F1指标提升7.66个百分点。我们的分析进一步揭示,WER对领域偏移高度敏感,而语义指标与下游任务性能的相关性更为密切。