Automatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.
翻译:自动语音识别(ASR)在噪声环境中性能严重下降。尽管语音增强(SE)前端能有效抑制背景噪声,但常会引入损害识别性能的伪影。观测融合(OA)通过融合带噪语音与SE增强语音解决了这一问题,在不修改SE或ASR模型参数的情况下提升了识别效果。本文提出一种基于可懂度引导的OA方法,其融合权重直接由后端ASR生成的可懂度估计值推导得出。与以往基于训练神经网络预测器的OA方法不同,本方法无需训练,既降低了复杂度又增强了泛化能力。在多种SE-ASR组合及数据集上的大量实验表明,该方法相较于现有OA基线具有更强的鲁棒性和性能提升。对基于可懂度引导的切换替代方案、帧级与语句级OA的进一步分析也验证了所提设计的有效性。