This paper proposes a self-learning method to incrementally train (fine-tune) a personalized Keyword Spotting (KWS) model after the deployment on ultra-low power smart audio sensors. We address the fundamental problem of the absence of labeled training data by assigning pseudo-labels to the new recorded audio frames based on a similarity score with respect to few user recordings. By experimenting with multiple KWS models with a number of parameters up to 0.5M on two public datasets, we show an accuracy improvement of up to +19.2% and +16.0% vs. the initial models pretrained on a large set of generic keywords. The labeling task is demonstrated on a sensor system composed of a low-power microphone and an energy-efficient Microcontroller (MCU). By efficiently exploiting the heterogeneous processing engines of the MCU, the always-on labeling task runs in real-time with an average power cost of up to 8.2 mW. On the same platform, we estimate an energy cost for on-device training 10x lower than the labeling energy if sampling a new utterance every 6.1 s or 18.8 s with a DS-CNN-S or a DS-CNN-M model. Our empirical result paves the way to self-adaptive personalized KWS sensors at the extreme edge.
翻译:本文提出一种自学习方法,用于在超低功耗智能音频传感器部署后,对个性化关键词检测模型进行增量训练(微调)。针对缺乏标注训练数据这一根本问题,我们通过基于新录制音频帧与少量用户录音的相似度评分分配伪标签来解决。通过在两个公共数据集上对参数量高达0.5M的多个KWS模型进行实验,相较于在大规模通用关键词集上预训练的初始模型,我们实现了最高达+19.2%和+16.0%的准确率提升。标注任务在由低功耗麦克风与高能效微控制器组成的传感器系统上得到验证。通过高效利用MCU的异构处理引擎,持续运行的标注任务以实时方式执行,平均功耗最高仅为8.2 mW。在同一平台上,我们估算得出:若采用DS-CNN-S或DS-CNN-M模型,并分别以每6.1秒或18.8秒的频率采集新语音样本,其设备端训练能耗将比标注能耗低10倍。我们的实证结果为极端边缘场景下的自适应个性化KWS传感器开辟了道路。