To segment a signal into blocks to be analyzed, few-shot keyword spotting (KWS) systems often utilize a sliding window of fixed size. Because of the varying lengths of different keywords or their spoken instances, choosing the right window size is a problem: A window should be long enough to contain all necessary information needed to recognize a keyword but a longer window may contain irrelevant information such as multiple words or noise and thus makes it difficult to reliably detect on- and offsets of keywords. We propose TACos, a novel angular margin loss for deriving two-dimensional embeddings that retain temporal properties of the underlying speech signal. In experiments conducted on KWS-DailyTalk, a few-shot KWS dataset presented in this work, using these embeddings as templates for dynamic time warping is shown to outperform using other representations or a sliding window and that using time-reversed segments of the keywords during training improves the performance.
翻译:为将信号分割为待分析的片段,少样本关键词识别(KWS)系统通常采用固定大小的滑动窗口。由于不同关键词或其口语实例的长度存在差异,选择恰当的窗口尺寸成为难题:窗口需足够长以包含识别关键词所需的全部信息,但过长窗口可能引入无关信息(如多个词汇或噪声),导致难以可靠检测关键词的起止边界。本文提出TACos——一种新型角间隔损失函数,用于生成保留底层语音信号时域特性的二维嵌入。在本文构建的少样本KWS基准数据集KWS-DailyTalk上的实验表明,将该嵌入作为动态时间规整的模板不仅优于其他表示方法或滑动窗口方案,且训练过程中使用关键词的时间反转片段可进一步提升性能。