In this paper, we aim to improve the robustness of Keyword Spotting (KWS) systems in noisy environments while keeping a small memory footprint. We propose a new convolutional neural network (CNN) called FCA-Net, which combines mixer unit-based feature interaction with a two-dimensional convolution-based attention module. First, we introduce and compare lightweight attention methods to enhance noise robustness in CNN. Then, we propose an attention module that creates fine-grained attention weights to capture channel and frequency-specific information, boosting the model's ability to handle noisy conditions. By combining the mixer unit-based feature interaction with the attention module, we enhance performance. Additionally, we use a curriculum-based multi-condition training strategy. Our experiments show that our system outperforms current state-of-the-art solutions for small-footprint KWS in noisy environments, making it reliable for real-world use.
翻译:本文旨在提升关键词检测系统在噪声环境中的鲁棒性,同时保持较小的内存占用。我们提出了一种名为FCA-Net的新型卷积神经网络,该网络将基于混合单元的特征交互与基于二维卷积的注意力模块相结合。首先,我们引入并比较了轻量级注意力方法以增强CNN的噪声鲁棒性。随后,我们提出了一种注意力模块,该模块生成细粒度的注意力权重以捕捉通道和频率特定信息,从而提升模型处理噪声条件的能力。通过将基于混合单元的特征交互与注意力模块相结合,我们进一步提升了性能。此外,我们采用了基于课程学习的多条件训练策略。实验结果表明,我们的系统在噪声环境下的小规模关键词检测任务中优于当前最先进的解决方案,使其在实际应用中具有更高的可靠性。