Customizable keyword spotting (KWS) in continuous speech has attracted increasing attention due to its real-world application potential. While contrastive learning (CL) has been widely used to extract keyword representations, previous CL approaches all operate on pre-segmented isolated words and employ only audio-text representations matching strategy. However, for KWS in continuous speech, co-articulation and streaming word segmentation can easily yield similar audio patterns for different texts, which may consequently trigger false alarms. To address this issue, we propose a novel CL with Audio Discrimination (CLAD) approach to learning keyword representation with both audio-text matching and audio-audio discrimination ability. Here, an InfoNCE loss considering both audio-audio and audio-text CL data pairs is employed for each sliding window during training. Evaluations on the open-source LibriPhrase dataset show that the use of sliding-window level InfoNCE loss yields comparable performance compared to previous CL approaches. Furthermore, experiments on the continuous speech dataset LibriSpeech demonstrate that, by incorporating audio discrimination, CLAD achieves significant performance gain over CL without audio discrimination. Meanwhile, compared to two-stage KWS approaches, the end-to-end KWS with CLAD achieves not only better performance, but also significant speed-up.
翻译:连续语音中的可定制关键词识别(KWS)因其实际应用潜力而受到越来越多的关注。尽管对比学习(CL)已被广泛用于提取关键词表征,但以往的CL方法均作用于预分割的孤立词,且仅采用音频-文本表征匹配策略。然而,对于连续语音中的KWS,协同发音和流式词分割易使不同文本产生相似的音频模式,从而可能引发误报。为解决该问题,我们提出一种新颖的具备音频判别能力的对比学习方法(CLAD),该方法同时学习兼具音频-文本匹配和音频-音频判别能力的关键词表征。在训练过程中,针对每个滑动窗口采用同时考虑音频-音频和音频-文本对比学习数据对的InfoNCE损失函数。在开源LibriPhrase数据集上的评估表明,采用滑动窗口级InfoNCE损失可取得与以往CL方法相当的性能。此外,在连续语音数据集LibriSpeech上的实验证明,通过引入音频判别机制,CLAD相较于不含音频判别的CL方法获得了显著的性能提升。同时,与两阶段KWS方法相比,采用CLAD的端到端KWS不仅实现了更优性能,还获得了显著的加速效果。