Keyword spotting systems continuously process audio streams to detect keywords. One of the most challenging tasks in designing such systems is to reduce False Alarm (FA) which happens when the system falsely registers a keyword despite the keyword not being uttered. In this paper, we propose a simple yet elegant solution to this problem that follows from the law of total probability. We show that existing deep keyword spotting mechanisms can be improved by Successive Refinement, where the system first classifies whether the input audio is speech or not, followed by whether the input is keyword-like or not, and finally classifies which keyword was uttered. We show across multiple models with size ranging from 13K parameters to 2.41M parameters, the successive refinement technique reduces FA by up to a factor of 8 on in-domain held-out FA data, and up to a factor of 7 on out-of-domain (OOD) FA data. Further, our proposed approach is "plug-and-play" and can be applied to any deep keyword spotting model.
翻译:关键词唤醒系统持续处理音频流以检测关键词。设计此类系统最具挑战性的任务之一是减少误唤醒(FA),即系统在用户未说出关键词时错误地将其记录为有效唤醒。本文提出一种简洁而优雅的解决方案,该方案基于全概率公式。我们证明,现有的深度关键词唤醒机制可通过逐级精炼得到改进:系统首先对输入音频是否为语音进行分类,进而判断输入是否具有关键词特征,最后识别具体关键词。在参数规模从1.3万到241万的多个模型上,该逐级精炼技术对域内保留误唤醒数据实现了高达8倍的误唤醒率降低,对域外(OOD)误唤醒数据实现高达7倍的降低。此外,所提方法具有"即插即用"特性,可应用于任何深度关键词唤醒模型。