Connectionist Temporal Classification (CTC), a non-autoregressive training criterion, is widely used in online keyword spotting (KWS). However, existing CTC-based KWS decoding strategies either rely on Automatic Speech Recognition (ASR), which performs suboptimally due to its broad search over the acoustic space without keyword-specific optimization, or on KWS-specific decoding graphs, which are complex to implement and maintain. In this work, we propose a streaming decoding algorithm enhanced by Cross-layer Discrimination Consistency (CDC), tailored for CTC-based KWS. Specifically, we introduce a streamlined yet effective decoding algorithm capable of detecting the start of the keyword at any arbitrary position. Furthermore, we leverage discrimination consistency information across layers to better differentiate between positive and false alarm samples. Our experiments on both clean and noisy Hey Snips datasets show that the proposed streaming decoding strategy outperforms ASR-based and graph-based KWS baselines. The CDC-boosted decoding further improves performance, yielding an average absolute recall improvement of 6.8% and a 46.3% relative reduction in the miss rate compared to the graph-based KWS baseline, with a very low false alarm rate of 0.05 per hour.
翻译:连接时序分类(CTC)作为一种非自回归训练准则,已广泛应用于在线关键词检测(KWS)任务。然而,现有基于CTC的KWS解码策略要么依赖自动语音识别(ASR)系统——因其在声学空间中进行广泛搜索且缺乏关键词特异性优化而导致性能欠佳;要么依赖专门设计的KWS解码图——这类方法实现和维护复杂度较高。本研究提出一种基于跨层判别一致性(CDC)增强的流式解码算法,专为CTC-based KWS系统设计。具体而言,我们设计了一种简洁高效的解码算法,能够从任意位置检测关键词起始点。此外,通过利用网络层间的判别一致性信息,系统能更有效地区分正样本与误报样本。在干净及带噪Hey Snips数据集上的实验表明:所提出的流式解码策略优于基于ASR和基于解码图的KWS基线方法。经CDC增强的解码算法进一步提升了性能,相较于基于解码图的KWS基线,平均绝对召回率提升6.8%,漏检率相对降低46.3%,同时保持每小时仅0.05次的极低误报率。