On-device Deep Neural Network (DNN) training has been recognized as crucial for privacy-preserving machine learning at the edge. However, the intensive training workload and limited onboard computing resources pose significant challenges to the availability and efficiency of model training. While existing works address these challenges through native resource management optimization, we instead leverage our observation that edge environments usually comprise a rich set of accompanying trusted edge devices with idle resources beyond a single terminal. We propose Asteroid, a distributed edge training system that breaks the resource walls across heterogeneous edge devices for efficient model training acceleration. Asteroid adopts a hybrid pipeline parallelism to orchestrate distributed training, along with a judicious parallelism planning for maximizing throughput under certain resource constraints. Furthermore, a fault-tolerant yet lightweight pipeline replay mechanism is developed to tame the device-level dynamics for training robustness and performance stability. We implement Asteroid on heterogeneous edge devices with both vision and language models, demonstrating up to 12.2x faster training than conventional parallelism methods and 2.1x faster than state-of-the-art hybrid parallelism methods through evaluations. Furthermore, Asteroid can recover training pipeline 14x faster than baseline methods while preserving comparable throughput despite unexpected device exiting and failure.
翻译:设备端深度神经网络训练已被认为是实现边缘隐私保护机器学习的关键。然而,密集的训练工作负载与有限的计算资源对模型训练的可用性与效率构成了重大挑战。现有研究主要通过原生资源管理优化应对这些挑战,而我们则基于以下观察:边缘环境通常包含大量可信的边缘设备,其闲置资源远超单一终端。为此,我们提出Asteroid——一个分布式边缘训练系统,通过打破异构边缘设备间的资源壁垒来实现高效的模型训练加速。Asteroid采用混合流水线并行机制协调分布式训练,并配备智能并行规划策略,在特定资源约束下最大化训练吞吐量。此外,系统设计了具备容错能力的轻量级流水线重放机制,以应对设备级动态变化,保障训练鲁棒性与性能稳定性。我们在搭载视觉与语言模型的异构边缘设备上实现了Asteroid,评估结果表明:相较于传统并行方法,训练速度最高提升12.2倍;相比最先进的混合并行方法,速度提升达2.1倍。在遭遇设备意外退出或故障时,Asteroid能以比基线方法快14倍的速度恢复训练流水线,同时保持可比的吞吐量。