Audio-visual speech recognition (AVSR) research has gained a great success recently by improving the noise-robustness of audio-only automatic speech recognition (ASR) with noise-invariant visual information. However, most existing AVSR approaches simply fuse the audio and visual features by concatenation, without explicit interactions to capture the deep correlations between them, which results in sub-optimal multimodal representations for downstream speech recognition task. In this paper, we propose a cross-modal global interaction and local alignment (GILA) approach for AVSR, which captures the deep audio-visual (A-V) correlations from both global and local perspectives. Specifically, we design a global interaction model to capture the A-V complementary relationship on modality level, as well as a local alignment approach to model the A-V temporal consistency on frame level. Such a holistic view of cross-modal correlations enable better multimodal representations for AVSR. Experiments on public benchmarks LRS3 and LRS2 show that our GILA outperforms the supervised learning state-of-the-art.
翻译:音视频语音识别(AVSR)研究近年来取得了巨大成功,通过利用噪声不变的视觉信息提升了纯音频自动语音识别(ASR)的噪声鲁棒性。然而,大多数现有AVSR方法仅通过简单的拼接方式融合音频和视觉特征,缺乏捕获两者深度关联的显式交互,导致为下游语音识别任务生成次优的多模态表示。本文提出了一种用于AVSR的跨模态全局交互与局部对齐(GILA)方法,该方法从全局和局部两个视角捕获深层的音视频(A-V)相关性。具体而言,我们设计了一个全局交互模型以捕获模态层面的A-V互补关系,以及一个局部对齐方法以建模帧级别的A-V时间一致性。这种对跨模态相关性的全局视角有助于为AVSR生成更优的多模态表示。在公开基准LRS3和LRS2上的实验表明,我们的GILA方法超越了有监督学习的最新水平。