This paper further explores our previous wake word spotting system ranked 2-nd in Track 1 of the MISP Challenge 2021. First, we investigate a robust unimodal approach based on 3D and 2D convolution and adopt the simple attention module (SimAM) for our system to improve performance. Second, we explore different combinations of data augmentation methods for better performance. Finally, we study the fusion strategies, including score-level, cascaded and neural fusion. Our proposed multimodal system leverages multimodal features and uses the complementary visual information to mitigate the performance degradation of audio-only systems in complex acoustic scenarios. Our system obtains a false reject rate of 2.15% and a false alarm rate of 3.44% in the evaluation set of the competition database, which achieves the new state-of-the-art performance by 21% relative improvement compared to previous systems.
翻译:本文深入探讨了我们在2021年MISP挑战赛Track 1中排名第二的唤醒词检测系统。首先,我们研究了一种基于3D和2D卷积的鲁棒单模态方法,并引入简单注意力模块(SimAM)以提升系统性能。其次,我们探索了不同数据增强方法的组合以取得更优效果。最后,我们研究了包括分数级融合、级联融合和神经融合在内的融合策略。所提出的多模态系统利用多模态特征,借助互补的视觉信息来缓解复杂声学场景中纯音频系统的性能退化。该系统在比赛数据库的评测集上实现了2.15%的误拒率和3.44%的误报率,相较于先前系统取得了21%的相对性能提升,达到了新的最优水平。