Audio-visual speech recognition (AVSR) provides a promising solution to ameliorate the noise-robustness of audio-only speech recognition with visual information. However, most existing efforts still focus on audio modality to improve robustness considering its dominance in AVSR task, with noise adaptation techniques such as front-end denoise processing. Though effective, these methods are usually faced with two practical challenges: 1) lack of sufficient labeled noisy audio-visual training data in some real-world scenarios and 2) less optimal model generality to unseen testing noises. In this work, we investigate the noise-invariant visual modality to strengthen robustness of AVSR, which can adapt to any testing noises while without dependence on noisy training data, a.k.a., unsupervised noise adaptation. Inspired by human perception mechanism, we propose a universal viseme-phoneme mapping (UniVPM) approach to implement modality transfer, which can restore clean audio from visual signals to enable speech recognition under any noisy conditions. Extensive experiments on public benchmarks LRS3 and LRS2 show that our approach achieves the state-of-the-art under various noisy as well as clean conditions. In addition, we also outperform previous state-of-the-arts on visual speech recognition task.
翻译:音视频语音识别(AVSR)通过融合视觉信息,为解决纯音频语音识别在噪声环境下的鲁棒性问题提供了有前景的方案。然而,由于音频模态在AVSR任务中占据主导地位,现有研究仍主要聚焦于改进音频模态的鲁棒性,常采用前端降噪等噪声自适应技术。尽管这些方法有效,但在实际应用中面临两个挑战:1)某些真实场景中缺乏充足标注的含噪音视频训练数据;2)模型对未见测试噪声的泛化能力欠佳。本文研究具有噪声不变性的视觉模态,以增强AVSR系统的鲁棒性——该方法无需依赖含噪训练数据即可适应任意测试噪声(即无监督噪声自适应)。受人类感知机制启发,我们提出一种通用的视素-音素映射(UniVPM)方法实现模态迁移,该模型可从视觉信号中重构干净音频,从而在任意噪声条件下完成语音识别。在公开基准数据集LRS3和LRS2上的大量实验表明,本方法在多种含噪及干净条件下均达到最先进水平。此外,在纯视觉语音识别任务中,我们也超越了此前的最优方法。