We introduce an efficient few-shot keyword spotting model for edge devices, EdgeSpot, that pairs an optimized version of a BC-ResNet-based acoustic backbone with a trainable Per-Channel Energy Normalization frontend and lightweight temporal self-attention. Knowledge distillation is utilized during training by employing a self-supervised teacher model, optimized with Sub-center ArcFace loss. This study demonstrates that the EdgeSpot model consistently provides better accuracy at a fixed false-alarm rate (FAR) than strong BC-ResNet baselines. The largest variant, EdgeSpot-4, improves the 10-shot accuracy at 1% FAR from 73.7% to 82.0%, which requires only 29.4M MACs with 128k parameters.
翻译:本文提出了一种面向边缘设备的高效小样本关键词检测模型EdgeSpot,该模型将基于BC-ResNet的优化声学骨干网络与可训练的逐通道能量归一化前端及轻量级时序自注意力机制相结合。训练过程中采用自监督教师模型进行知识蒸馏,并使用子中心ArcFace损失函数进行优化。研究表明,在固定误报率下,EdgeSpot模型始终比强大的BC-ResNet基线模型具有更高的准确率。其中最大变体EdgeSpot-4在1%误报率下的10样本准确率从73.7%提升至82.0%,仅需29.4M乘加运算和128k参数。