Real-time machine learning has recently attracted significant interest due to its potential to support instantaneous learning, adaptation, and decision making in a wide range of application domains, including self-driving vehicles, intelligent transportation, and industry automation. We investigate real-time ML in a federated edge intelligence (FEI) system, an edge computing system that implements federated learning (FL) solutions based on data samples collected and uploaded from decentralized data networks. FEI systems often exhibit heterogenous communication and computational resource distribution, as well as non-i.i.d. data samples, resulting in long model training time and inefficient resource utilization. Motivated by this fact, we propose a time-sensitive federated learning (TS-FL) framework to minimize the overall run-time for collaboratively training a shared ML model. Training acceleration solutions for both TS-FL with synchronous coordination (TS-FL-SC) and asynchronous coordination (TS-FL-ASC) are investigated. To address straggler effect in TS-FL-SC, we develop an analytical solution to characterize the impact of selecting different subsets of edge servers on the overall model training time. A server dropping-based solution is proposed to allow slow-performance edge servers to be removed from participating in model training if their impact on the resulting model accuracy is limited. A joint optimization algorithm is proposed to minimize the overall time consumption of model training by selecting participating edge servers, local epoch number. We develop an analytical expression to characterize the impact of staleness effect of asynchronous coordination and straggler effect of FL on the time consumption of TS-FL-ASC. Experimental results show that TS-FL-SC and TS-FL-ASC can provide up to 63% and 28% of reduction, in the overall model training time, respectively.
翻译:实时机器学习因其在自动驾驶、智能制造等广泛领域支持即时学习、自适应及决策的潜力而备受关注。本文研究联邦边缘智能(FEI)系统中的实时机器学习——该边缘计算系统基于去中心化数据网络采集上传的数据样本实现联邦学习(FL)方案。FEI系统常呈现异构通信与计算资源分布及非独立同分布数据样本特性,导致模型训练时长增加与资源利用率低下。鉴于此,我们提出时间敏感联邦学习(TS-FL)框架以最小化协同训练共享ML模型的总耗时。本文分别研究同步协调(TS-FL-SC)与异步协调(TS-FL-ASC)两种TS-FL模式的训练加速方案。针对TS-FL-SC中的掉队者效应,我们构建解析模型刻画边缘服务器子集选择对总训练时长的影响,并提出基于服务器丢弃的解决方案,允许对模型精度影响有限的慢速服务器退出训练。通过联合优化算法选择参与训练的边缘服务器与本地轮次数量,实现模型训练总耗时最小化。针对TS-FL-ASC,我们推导解析表达式表征异步协调的陈旧效应与FL掉队者效应对耗时的综合影响。实验表明,TS-FL-SC与TS-FL-ASC分别可降低63%与28%的模型训练总耗时。