Real-world complex acoustic environments especially the ones with a low signal-to-noise ratio (SNR) will bring tremendous challenges to a keyword spotting (KWS) system. Inspired by the recent advances of neural speech enhancement and context bias in speech recognition, we propose a robust audio context bias based DCCRN-KWS model to address this challenge. We form the whole architecture as a multi-task learning framework for both denosing and keyword spotting, where the DCCRN encoder is connected with the KWS model. Helped with the denoising task, we further introduce an audio context bias module to leverage the real keyword samples and bias the network to better iscriminate keywords in noisy conditions. Feature merge and complex context linear modules are also introduced to strength such discrimination and to effectively leverage contextual information respectively. Experiments on the internal challenging dataset and the HIMIYA public dataset show that our DCCRN-KWS system is superior in performance, while ablation study demonstrates the good design of the whole model.
翻译:实际复杂声学环境,尤其是低信噪比(SNR)环境,会给关键词唤醒(KWS)系统带来巨大挑战。受近期神经语音增强及语音识别中上下文偏置技术进展的启发,我们提出一种基于鲁棒音频上下文偏置的DCCRN-KWS模型来应对这一挑战。我们将整体架构构建为同时进行降噪与关键词唤醒的多任务学习框架,其中DCCRN编码器与KWS模型相连接。在降噪任务的辅助下,我们进一步引入音频上下文偏置模块,以利用真实关键词样本并引导网络在噪声条件下更好地区分关键词。同时引入特征融合模块与复杂上下文线性模块,分别增强这种区分能力并有效利用上下文信息。在内部挑战性数据集及HIMIYA公开数据集上的实验表明,我们的DCCRN-KWS系统性能优越,消融研究则验证了整体模型的良好设计。