Distributed deep learning (DDL) training systems are designed for cloud and data-center environments that assumes homogeneous compute resources, high network bandwidth, sufficient memory and storage, as well as independent and identically distributed (IID) data across all nodes. However, these assumptions don't necessarily apply on the edge, especially when training neural networks on streaming data in an online manner. Computing on the edge suffers from both systems and statistical heterogeneity. Systems heterogeneity is attributed to differences in compute resources and bandwidth specific to each device, while statistical heterogeneity comes from unbalanced and skewed data on the edge. Different streaming-rates among devices can be another source of heterogeneity when dealing with streaming data. If the streaming rate is lower than training batch-size, device needs to wait until enough samples have streamed in before performing a single iteration of stochastic gradient descent (SGD). Thus, low-volume streams act like stragglers slowing down devices with high-volume streams in synchronous training. On the other hand, data can accumulate quickly in the buffer if the streaming rate is too high and the devices can't train at line-rate. In this paper, we introduce ScaDLES to efficiently train on streaming data at the edge in an online fashion, while also addressing the challenges of limited bandwidth and training with non-IID data. We empirically show that ScaDLES converges up to 3.29 times faster compared to conventional distributed SGD.
翻译:分布式深度学习训练系统专为云端和数据中心环境设计,这些环境假设节点间具有同质的计算资源、高网络带宽、充足的存储和内存,以及独立同分布(IID)数据。然而,这些假设在边缘场景下不一定成立,尤其是以在线方式对流式数据进行神经网络训练时。边缘计算面临系统和统计两方面的异质性:系统异质性源于各设备计算资源与带宽的差异,而统计异质性则来自边缘数据的不平衡与偏斜。当处理流式数据时,设备间不同的流速率可能成为另一个异质性来源。若流速率低于训练批次大小,设备需等待足够样本流入后才能执行单次随机梯度下降迭代。因此,低容量流在同步训练中会成为拖慢高容量流设备的"掉队者"。反之,当流速率过高且设备无法达到线速训练时,数据会在缓冲区快速累积。本文提出ScaDLES,以在线方式高效训练边缘流式数据,同时应对有限带宽和非独立同分布数据训练的挑战。实验表明,与传统分布式SGD相比,ScaDLES的收敛速度最高提升3.29倍。