Audio-visual speech recognition (AVSR) typically improves recognition accuracy in noisy environments by integrating noise-immune visual cues with audio signals. Nevertheless, high-noise audio inputs are prone to introducing adverse interference into the feature fusion process. To mitigate this, recent AVSR methods often adopt mask-based strategies to filter audio noise during feature interaction and fusion, yet such methods risk discarding semantically relevant information alongside noise. In this work, we propose an end-to-end noise-robust AVSR framework coupled with speech enhancement, eliminating the need for explicit noise mask generation. This framework leverages a Conformer-based bottleneck fusion module to implicitly refine noisy audio features with video assistance. By reducing modality redundancy and enhancing inter-modal interactions, our method preserves speech semantic integrity to achieve robust recognition performance. Experimental evaluations on the public LRS3 benchmark suggest that our method outperforms prior advanced mask-based baselines under noisy conditions.
翻译:视听语音识别通常通过整合抗噪的视觉线索与音频信号来提升噪声环境下的识别准确率。然而,高噪声音频输入容易在特征融合过程中引入不利干扰。为缓解此问题,近期的视听语音识别方法常采用基于掩码的策略,在特征交互与融合阶段过滤音频噪声,但此类方法存在连同噪声一并丢弃语义相关信息的风险。本研究提出一种结合语音增强的端到端抗噪视听语音识别框架,无需显式生成噪声掩码。该框架利用基于Conformer的瓶颈融合模块,在视频辅助下隐式优化含噪音频特征。通过降低模态冗余并增强模态间交互,我们的方法保持了语音语义完整性,从而实现鲁棒的识别性能。在公开LRS3基准上的实验评估表明,在噪声条件下我们的方法优于先前先进的基于掩码的基线模型。