TrainDeeploy: Hardware-Accelerated Parameter-Efficient Fine-Tuning of Small Transformer Models at the Extreme Edge

On-device tuning of deep neural networks enables long-term adaptation at the edge while preserving data privacy. However, the high computational and memory demands of backpropagation pose significant challenges for ultra-low-power, memory-constrained extreme-edge devices. These challenges are further amplified for attention-based models due to their architectural complexity and computational scale. We present TrainDeeploy, a framework that unifies efficient inference and on-device training on heterogeneous ultra-low-power System-on-Chips (SoCs). TrainDeeploy provides the first complete on-device training pipeline for extreme-edge SoCs supporting both Convolutional Neural Networks (CNNs) and Transformer models, together with multiple training strategies such as selective layer-wise fine-tuning and Low-Rank Adaptation (LoRA). On a RISC-V-based heterogeneous SoC, we demonstrate the first end-to-end on-device fine-tuning of a Compact Convolutional Transformer (CCT), achieving up to 11 trained images per second. We show that LoRA reduces dynamic memory usage by 23%, decreases the number of trainable parameters and gradients by 15x, and reduces memory transfer volume by 1.6x compared to full backpropagation. TrainDeeploy achieves up to 4.6 FLOP/cycle on CCT (0.28M parameters, 71-126M FLOPs) and up to 13.4 FLOP/cycle on Deep-AE (0.27M parameters, 0.8M FLOPs), while expanding the scope of prior frameworks to support both CNN and Transformer models with parameter-efficient tuning on extreme-edge platforms.

翻译：深度神经网络的设备端调优能够在保护数据隐私的同时实现边缘侧的长期自适应。然而，反向传播的高计算与内存需求对超低功耗、内存受限的极限边缘设备构成了重大挑战。对于基于注意力机制的模型，由于其架构复杂性和计算规模，这些挑战被进一步放大。我们提出了TrainDeeploy，一个在异构超低功耗片上系统（SoC）上统一高效推理与设备端训练的一体化框架。TrainDeeploy为支持卷积神经网络（CNN）和Transformer模型的极限边缘SoC提供了首个完整的设备端训练流水线，并集成了选择性逐层微调与低秩自适应（LoRA）等多种训练策略。在一款基于RISC-V的异构SoC上，我们首次演示了紧凑卷积Transformer（CCT）的端到端设备端微调，最高达到每秒处理11张训练图像。我们证明，与完整的反向传播相比，LoRA将动态内存使用量降低了23%，将可训练参数和梯度的数量减少了15倍，并将内存传输量减少了1.6倍。TrainDeeploy在CCT模型（0.28M参数，71-126M FLOPs）上实现了最高4.6 FLOP/周期，在Deep-AE模型（0.27M参数，0.8M FLOPs）上实现了最高13.4 FLOP/周期，同时将先前框架的应用范围扩展到支持CNN和Transformer模型在极限边缘平台上进行参数高效调优。