Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technology revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power Micro-Controller Units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy consumption of 0.64 mJ, a latency of 0.54 ms per inference, a super-linear speedup of 26.1 x, and an Energy Delay Product (EDP) improvement of 27.2 x, compared to a single-chip system. On MobileBERT, the distributed system's runtime is 38.8 ms, with a super-linear 4.7 x speedup when using 4 MCUs compared to a single-chip system.
翻译:基于新兴Transformer模型的上下文人工智能(AI)预计将推动新一代智能眼镜等交互式可穿戴设备的下一次技术革命。通过将众多传感器与小型低功耗微控制器单元(MCU)相结合,这些设备将实现端侧智能与传感器控制。此类系统的主要瓶颈在于MCU有限的片上存储器容量。本文提出一种在低功耗可穿戴设备上部署实际Transformer模型的方法,该方法利用MCU分布式系统实现最小化片外通信量,将推理任务分割至多个设备,并支持权重完全驻留于片上存储器的执行模式。我们通过将TinyLlama-42M纯解码器模型部署在8个并行超低功耗MCU系统上验证该方案。相较于单芯片系统,该分布式系统实现了0.64 mJ的能耗、0.54 ms的单次推理延迟、26.1倍超线性加速比以及27.2倍的能耗延迟积(EDP)改进。在MobileBERT模型上,采用4个MCU的分布式系统运行时间为38.8 ms,较单芯片系统实现4.7倍超线性加速。