Automatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.
翻译:自动语音识别(ASR)在噪声环境下性能严重下降。尽管语音增强(SE)前端能有效抑制背景噪声,但常常引入损害识别性能的伪像。观测加噪(OA)通过融合含噪语音与经SE增强的语音来解决此问题,无需修改SE或ASR模型参数即可提升识别效果。本文提出一种可懂度引导的OA方法,其融合权重直接源自后端ASR获得的语音可懂度估计值。与以往基于训练神经网络预测器的OA方法不同,本方法无需训练,降低了复杂度并提升了泛化能力。在多样的SE-ASR组合与数据集上的大量实验表明,该方法具有强鲁棒性,且优于现有OA基线方法。基于可懂度引导的切换式替代方案及帧级与话语级OA的进一步分析,验证了所提设计的有效性。