The performance of the keyword spotting (KWS) system based on audio modality, commonly measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. Therefore, audio-visual keyword spotting, which leverages complementary relationships over multiple modalities, has recently gained much attention. However, current studies mainly focus on combining the exclusively learned representations of different modalities, instead of exploring the modal relationships during each respective modeling. In this paper, we propose a novel visual modality enhanced end-to-end KWS framework (VE-KWS), which fuses audio and visual modalities from two aspects. The first one is utilizing the speaker location information obtained from the lip region in videos to assist the training of multi-channel audio beamformer. By involving the beamformer as an audio enhancement module, the acoustic distortions, caused by the far field or noisy environments, could be significantly suppressed. The other one is conducting cross-attention between different modalities to capture the inter-modal relationships and help the representation learning of each modality. Experiments on the MSIP challenge corpus show that our proposed model achieves 2.79% false rejection rate and 2.95% false alarm rate on the Eval set, resulting in a new SOTA performance compared with the top-ranking systems in the ICASSP2022 MISP challenge.
翻译:基于音频模态的关键词唤醒(KWS)系统性能通常以误报率和误拒率衡量,但在远场和噪声条件下会显著下降。因此,利用多模态间互补关系的音视频关键词唤醒近年来受到广泛关注。然而,当前研究主要集中于融合不同模态的独立学习表征,而非在各自建模过程中探索模态关系。本文提出一种新颖的视觉模态增强端到端KWS框架(VE-KWS),从两个方面融合音频与视觉模态:其一,利用视频中唇部区域获取的说话人位置信息,辅助多通道音频波束形成器的训练。通过引入波束形成器作为音频增强模块,可显著抑制远场或噪声环境造成的声学失真;其二,在不同模态间进行交叉注意力计算,以捕捉模态间关系并促进各模态的表征学习。在MISP挑战语料库上的实验表明,本模型在评估集上实现了2.79%的误拒率和2.95%的误报率,相较于ICASSP2022 MISP挑战赛中的顶级系统达到了新的最优性能(SOTA)。