Real-world complex acoustic environments especially the ones with a low signal-to-noise ratio (SNR) will bring tremendous challenges to a keyword spotting (KWS) system. Inspired by the recent advances of neural speech enhancement and context bias in speech recognition, we propose a robust audio context bias based DCCRN-KWS model to address this challenge. We form the whole architecture as a multi-task learning framework for both denosing and keyword spotting, where the DCCRN encoder is connected with the KWS model. Helped with the denoising task, we further introduce an audio context bias module to leverage the real keyword samples and bias the network to better iscriminate keywords in noisy conditions. Feature merge and complex context linear modules are also introduced to strength such discrimination and to effectively leverage contextual information respectively. Experiments on the internal challenging dataset and the HIMIYA public dataset show that our DCCRN-KWS system is superior in performance, while ablation study demonstrates the good design of the whole model.
翻译:现实世界中复杂的声学环境,尤其是低信噪比(SNR)场景,将给关键词唤醒(KWS)系统带来巨大挑战。受近期神经语音增强和语音识别中上下文偏置技术进展的启发,我们提出了一种基于鲁棒音频上下文偏置的DCCRN-KWS模型来应对这一挑战。我们将整体架构构建为多任务学习框架,同时进行去噪和关键词唤醒,其中DCCRN编码器与KWS模型相连接。借助去噪任务的辅助,我们进一步引入了音频上下文偏置模块,以利用真实关键词样本,使网络能够在噪声条件下更好地区分关键词。同时引入了特征融合和复杂上下文线性模块,分别增强这种区分能力并有效利用上下文信息。在内部具有挑战性的数据集和HIMIYA公开数据集上的实验表明,我们的DCCRN-KWS系统在性能上具有优越性,而消融研究则验证了整体模型的良好设计。