While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. Experimental results on the MISP2022-AVSR Challenge dataset show the efficacy of our proposed system, achieving a concatenated minimum permutation character error rate (cpCER) of 30.57% on the Eval set and yielding up to 3.17% relative improvement compared with our previous system which ranked the second place in the challenge. Following the fusion of multiple systems, our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
翻译:尽管自动语音识别(ASR)系统在嘈杂环境下性能显著下降,但音视频语音识别(AVSR)系统旨在通过噪声不变的视觉线索补充音频流,从而提升系统鲁棒性。然而,当前研究主要聚焦于融合已充分学习的模态特征(如模态特定编码器的输出),而未考虑模态特征学习过程中的上下文关系。本研究提出一种基于多层交叉注意力融合的AVSR方法(MLCA-AVSR),通过在不同层次的音频/视觉编码器中融合各模态特征,促进其表示学习。在MISP2022-AVSR挑战赛数据集上的实验结果表明,所提系统在Eval集上实现了30.57%的级联最小排列字符错误率(cpCER),相较于本团队在挑战赛中排名第二的先前系统取得了最高3.17%的相对提升。通过多系统融合,本方法超越第一名系统,在该数据集上建立了29.13% cpCER的新最优性能(SOTA)。