To segment a signal into blocks to be analyzed, few-shot keyword spotting (KWS) systems often utilize a sliding window of fixed size. Because of the varying lengths of different keywords or their spoken instances, choosing the right window size is a problem: A window should be long enough to contain all necessary information needed to recognize a keyword but a longer window may contain irrelevant information such as multiple words or noise and thus makes it difficult to reliably detect on- and offsets of keywords. We propose TACos, a novel angular margin loss for deriving two-dimensional embeddings that retain temporal properties of the underlying speech signal. In experiments conducted on KWS-DailyTalk, a few-shot KWS dataset presented in this work, using these embeddings as templates for dynamic time warping is shown to outperform using other representations or a sliding window and that using time-reversed segments of the keywords during training improves the performance.
翻译:为将信号分割成待分析片段,少样本关键词识别系统通常采用固定大小的滑动窗口。由于不同关键词或其语音实例存在时长差异,选择合适的窗口尺寸成为难题:窗口需足够长以包含识别关键词所需的所有信息,但过长的窗口可能引入多词或噪声等无关信息,从而难以可靠检测关键词的起始与结束边界。本文提出TACos,一种新型角间隔损失函数,用于生成保留语音信号时间特性的二维嵌入。在本文提出的少样本关键词识别数据集KWS-DailyTalk上进行的实验中,将这些嵌入作为动态时间规整的模板,其性能优于使用其他表征或滑动窗口的方法,且训练阶段使用关键词的时间反转片段可提升模型性能。