Synchronous federated learning scales poorly due to the straggler effect. Asynchronous algorithms increase the update throughput by processing updates upon arrival, but they introduce two fundamental challenges: gradient staleness, which degrades convergence, and bias toward faster clients under heterogeneous data distributions. Although algorithms such as AsyncSGD and Generalized AsyncSGD mitigate this bias via client-side task queues, most existing analyses neglect the underlying queueing dynamics and lack closed-form characterizations of the update throughput and gradient staleness. To close this gap, we develop a stochastic queueing-network framework for Generalized AsyncSGD that jointly models random computation times at the clients and the central server, as well as random uplink and downlink communication delays. Leveraging product-form network theory, we derive a closed-form expression for the update throughput, alongside closed-form upper bounds for both the communication round complexity and the expected wall-clock time required to reach an $ε$-stationary point. These results formally characterize the trade-off between gradient staleness and wall-clock convergence speed. We further extend the framework to quantify energy consumption under stochastic timing, revealing an additional trade-off between convergence speed and energy efficiency. Building on these analytical results, we propose gradient-based optimization strategies to jointly optimize routing and concurrency. Experiments on EMNIST demonstrate reductions of 29%--46% in convergence time and 36%--49% in energy consumption compared to AsyncSGD.
翻译:同步联邦学习因掉队者效应而扩展性不佳。异步算法通过随到随处理更新来提高更新吞吐量,但却引入了两个基本挑战:梯度陈旧性会降低收敛速度,而在异构数据分布下会产生对较快客户端的偏向。尽管诸如AsyncSGD和Generalized AsyncSGD等算法通过客户端任务队列减轻了这种偏向,但现有分析大多忽略了底层的排队动态,并且缺乏对更新吞吐量和梯度陈旧性的闭式刻画。为弥补这一空白,我们为Generalized AsyncSGD开发了一个排队网络框架,该框架联合建模了客户端和中心服务器的随机计算时间,以及随机上行和下行通信延迟。利用乘积形式网络理论,我们推导出更新吞吐量的闭式表达式,以及达到$\varepsilon$稳定点所需的通信轮数复杂度和期望挂钟时间的闭式上界。这些结果形式化地刻画了梯度陈旧性与挂钟收敛速度之间的权衡。我们进一步将该框架扩展到量化随机时序下的能耗,揭示了收敛速度与能效之间的另一个权衡。基于这些分析结果,我们提出了基于梯度的优化策略,以联合优化路由和并发性。在EMNIST上的实验表明,与AsyncSGD相比,收敛时间减少了29%--46%,能耗减少了36%--49%。