Keyword Spotting (KWS) models on embedded devices should adapt fast to new user-defined words without forgetting previous ones. Embedded devices have limited storage and computational resources, thus, they cannot save samples or update large models. We consider the setup of embedded online continual learning (EOCL), where KWS models with frozen backbone are trained to incrementally recognize new words from a non-repeated stream of samples, seen one at a time. To this end, we propose Temporal Aware Pooling (TAP) which constructs an enriched feature space computing high-order moments of speech features extracted by a pre-trained backbone. Our method, TAP-SLDA, updates a Gaussian model for each class on the enriched feature space to effectively use audio representations. In experimental analyses, TAP-SLDA outperforms competitors on several setups, backbones, and baselines, bringing a relative average gain of 11.3% on the GSC dataset.
翻译:嵌入式设备上的关键词识别(KWS)模型需快速适应新用户定义的词汇,同时不遗忘旧词汇。由于嵌入式设备存储和计算资源有限,无法保存样本或更新大型模型。我们考虑嵌入式在线连续学习(EOCL)场景,其中采用冻结骨干网络的KWS模型被训练用于增量识别来自非重复样本流的新单词,且每次仅处理一个样本。为此,我们提出时间感知池化(TAP)方法,通过计算预训练骨干网络提取的语音特征的高阶矩来构建增强特征空间。我们的方法TAP-SLDA在增强特征空间上为每个类别更新高斯模型,从而有效利用音频表示。实验分析表明,TAP-SLDA在多种设置、骨干网络和基线方法中均优于竞争对手,在GSC数据集上实现了11.3%的相对平均增益。