On-device training is essential for user personalisation and privacy. With the pervasiveness of IoT devices and microcontroller units (MCU), this task becomes more challenging due to the constrained memory and compute resources, and the limited availability of labelled user data. Nonetheless, prior works neglect the data scarcity issue, require excessively long training time (e.g. a few hours), or induce substantial accuracy loss ($\geq$10\%). We propose TinyTrain, an on-device training approach that drastically reduces training time by selectively updating parts of the model and explicitly coping with data scarcity. TinyTrain introduces a task-adaptive sparse-update method that dynamically selects the layer/channel based on a multi-objective criterion that jointly captures user data, the memory, and the compute capabilities of the target device, leading to high accuracy on unseen tasks with reduced computation and memory footprint. TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0\% in accuracy, while reducing the backward-pass memory and computation cost by up to 2,286$\times$ and 7.68$\times$, respectively. Targeting broadly used real-world edge devices, TinyTrain achieves 9.5$\times$ faster and 3.5$\times$ more energy-efficient training over status-quo approaches, and 2.8$\times$ smaller memory footprint than SOTA approaches, while remaining within the 1 MB memory envelope of MCU-grade platforms.
翻译:设备端训练对于用户个性化和隐私保护至关重要。随着物联网设备与微控制器单元(MCU)的普及,受限于内存与计算资源的约束,以及标注用户数据的匮乏,这一任务变得更具挑战性。然而,现有工作忽视了数据稀缺问题,需要过长的训练时间(如数小时),或导致显著的精度损失(≥10%)。我们提出TinyTrain,一种设备端训练方法,通过选择性更新模型部分并显式应对数据稀缺性,大幅减少训练时间。TinyTrain引入了任务自适应稀疏更新方法,基于联合捕获用户数据、目标设备内存与计算能力的多目标准则,动态选择层/通道,从而在减少计算和内存占用的同时,确保在未知任务上获得高精度。与对整个网络进行朴素微调相比,TinyTrain的精度提升了3.6-5.0%,同时反向传播内存与计算成本分别降低高达2,286倍和7.68倍。针对广泛使用的真实边缘设备,TinyTrain相较于现有方法实现9.5倍加速和3.5倍能效提升,内存占用相比最先进方法减少2.8倍,且始终保持在MCU级平台1 MB内存预算范围内。