This paper proposes a self-learning framework to incrementally train (fine-tune) a personalized Keyword Spotting (KWS) model after the deployment on ultra-low power smart audio sensors. We address the fundamental problem of the absence of labeled training data by assigning pseudo-labels to the new recorded audio frames based on a similarity score with respect to few user recordings. By experimenting with multiple KWS models with a number of parameters up to 0.5M on two public datasets, we show an accuracy improvement of up to +19.2% and +16.0% vs. the initial models pretrained on a large set of generic keywords. The labeling task is demonstrated on a sensor system composed of a low-power microphone and an energy-efficient Microcontroller (MCU). By efficiently exploiting the heterogeneous processing engines of the MCU, the always-on labeling task runs in real-time with an average power cost of up to 8.2 mW. On the same platform, we estimate an energy cost for on-device training 10x lower than the labeling energy if sampling a new utterance every 5 s or 16.4 s with a DS-CNN-S or a DS-CNN-M model. Our empirical result paves the way to self-adaptive personalized KWS sensors at the extreme edge.
翻译:本文提出了一种自学习框架,用于在超低功耗智能音频传感器部署后,对个性化关键词检测模型进行增量训练(微调)。针对缺乏标注训练数据这一根本问题,我们基于新录制音频帧与少量用户录音的相似度评分,为其分配伪标签。通过在两个公共数据集上对参数量最高达0.5M的多种KWS模型进行实验,相较于在大规模通用关键词集上预训练的初始模型,我们实现了最高+19.2%和+16.0%的准确率提升。标注任务在由低功耗麦克风与高能效微控制器组成的传感器系统上得到验证。通过高效利用MCU的异构处理引擎,持续运行的标注任务可实现实时处理,平均功耗最高仅为8.2 mW。在同一平台上,我们估算得出:若采用DS-CNN-S或DS-CNN-M模型且每5秒或16.4秒采集一次新语音,设备端训练能耗可比标注能耗低10倍。本实证研究为极端边缘场景下的自适应个性化KWS传感器开辟了新路径。