On-device training of DNNs allows models to adapt and fine-tune to newly collected data or changing domains while deployed on microcontroller units (MCUs). However, DNN training is a resource-intensive task, making the implementation and execution of DNN training algorithms on MCUs challenging due to low processor speeds, constrained throughput, limited floating-point support, and memory constraints. In this work, we explore on-device training of DNNs for Cortex-M MCUs. We present a method that enables efficient training of DNNs completely in place on the MCU using fully quantized training (FQT) and dynamic partial gradient updates. We demonstrate the feasibility of our approach on multiple vision and time-series datasets and provide insights into the tradeoff between training accuracy, memory overhead, energy, and latency on real hardware.
翻译:深度神经网络(DNN)的片上训练使得模型能够在部署于微控制器单元(MCU)时,根据新收集的数据或变化的应用领域进行自适应与微调。然而,DNN训练是一项资源密集型任务,由于微控制器处理速度低、吞吐量受限、浮点运算支持有限以及内存约束,在MCU上实现并执行DNN训练算法具有挑战性。本研究探索了面向Cortex-M系列MCU的DNN片上训练方法。我们提出一种利用全量化训练(FQT)与动态部分梯度更新的技术,支持在MCU上完全就地执行高效的DNN训练。我们在多个视觉与时间序列数据集上验证了该方法的可行性,并在真实硬件上深入分析了训练精度、内存开销、能耗与延迟之间的权衡关系。