Driven by the ever-growing volume and decentralized nature of data, coupled with the escalating size of modern models, distributed deep learning (DDL) has been entrenched as the preferred paradigm for training. However, frequent synchronization of DL models, encompassing millions to many billions of parameters, creates a communication bottleneck, severely hindering scalability. Worse yet, DDL algorithms typically waste valuable bandwidth, and make themselves less practical in bandwidth-constrained federated settings, by relying on overly simplistic, periodic, and rigid synchronization schedules. To address these shortcomings, we propose Federated Dynamic Averaging (FDA), a communication-efficient DDL strategy that dynamically triggers synchronization based on the value of the model variance. Through extensive experiments across a wide range of learning tasks we demonstrate that FDA reduces communication cost by orders of magnitude, compared to both traditional and cutting-edge communication-efficient algorithms. Remarkably, FDA achieves this without sacrificing convergence speed - in stark contrast to the trade-offs encountered in the field. Additionally, we show that FDA maintains robust performance across diverse data heterogeneity settings.
翻译:受数据量不断增长和去中心化特性以及现代模型规模日益扩大的驱动,分布式深度学习已成为训练的首选范式。然而,频繁同步包含数百万至数百亿参数的深度学习模型会形成通信瓶颈,严重阻碍可扩展性。更糟糕的是,分布式深度学习算法通常依赖过于简单、周期性和僵化的同步调度,浪费了宝贵的带宽,使其在带宽受限的联邦学习环境中实用性降低。为应对这些缺陷,我们提出联邦动态平均,这是一种基于模型方差值动态触发同步的高效通信分布式深度学习策略。通过在广泛学习任务上进行的大量实验,我们证明相较于传统及前沿的高效通信算法,联邦动态平均能将通信成本降低数个数量级。值得注意的是,联邦动态平均在不牺牲收敛速度的情况下实现了这一目标——这与该领域常见的性能权衡形成鲜明对比。此外,我们证明联邦动态平均能在不同的数据异质性设置下保持稳健性能。