This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We leverage temporal locality in the speech inputs to a device and reuse recent SLU inferences accordingly. Our idea is simple: let the device match incoming inputs against cached results, and only offload inputs not matched to any cached ones to the cloud for full inference. Realization of this idea, however, is non-trivial: the device needs to compare acoustic features in a robust yet low-cost way. To this end, we present SpeechCache (or SC), a speech cache for tiny devices. It matches speech inputs at two levels of representations: first by sequences of clustered raw sound units, then as sequences of phonemes. Working in tandem, the two representations offer complementary tradeoffs between cost and efficiency. To boost accuracy even further, our cache learns to personalize: with the mismatched and then offloaded inputs, it continuously finetunes the device's feature extractors with the assistance of the cloud. We implement SC on an off-the-shelf STM32 microcontroller. The complete implementation has a small memory footprint of 2MB. Evaluated on challenging speech benchmarks, our system resolves 45%-90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech recognition services. The benefit brought by our proposed SC is notable even in adversarial settings - noisy environments, cold cache, or one device shared by a number of users.
翻译:本文针对微控制器类嵌入式设备上的口语理解任务,提出一种融合设备端执行与云端卸载的创新方案。我们利用语音输入的时间局部性特征,通过复用近期SLU推理结果来提升效率。核心思路简洁直观:设备将输入语音与缓存结果进行匹配,仅将未命中缓存的输入卸载至云端执行完整推理。然而实现这一思路颇具挑战:设备需以低成本方式稳健比较声学特征。为此,我们提出面向微型设备的语音缓存SpeechCache(SC)。该方法在两级表征层面对语音输入进行匹配:首先通过聚类后的原始声学单元序列,再通过音素序列。两种表征方法协同工作,在计算成本与匹配效率间形成互补。为进一步提升准确率,缓存系统具备个性化学习能力:针对未匹配而需卸载的输入,系统在云端辅助下持续微调设备端的特征提取器。我们在商用的STM32微控制器上实现了SC系统,完整实现仅需2MB内存占用。在挑战性语音基准测试中,该系统可解决设备端45%-90%的输入需求,与主流云端语音识别服务卸载方案相比,平均延迟降低高达80%。即使在对抗性场景(如噪声环境、冷启动缓存或多用户共享设备)下,本文提出的SC方案仍能带来显著性能提升。