With advancements in AI infrastructure and Trusted Execution Environment (TEE) technology, Federated Learning as a Service (FLaaS) through JointCloud Computing (JCC) is promising to break through the resource constraints caused by heterogeneous edge devices in the traditional Federated Learning (FL) paradigm. Specifically, with the protection from TEE, data owners can achieve efficient model training with high-performance AI services in the cloud. By providing additional FL services, cloud service providers can achieve collaborative learning among data owners. However, FLaaS still faces three challenges, i.e., i) low training performance caused by heterogeneous data among data owners, ii) high communication overhead among different clouds (i.e., data centers), and iii) lack of efficient resource scheduling strategies to balance training time and cost. To address these challenges, this paper presents a novel asynchronous FL approach named NebulaFL for collaborative model training among multiple clouds. To address data heterogeneity issues, NebulaFL adopts a version control-based asynchronous FL training scheme in each data center to balance training time among data owners. To reduce communication overhead, NebulaFL adopts a decentralized model rotation mechanism to achieve effective knowledge sharing among data centers. To balance training time and cost, NebulaFL integrates a reward-guided strategy for data owners selection and resource scheduling. The experimental results demonstrate that, compared to the state-of-the-art FL methods, NebulaFL can achieve up to 5.71\% accuracy improvement. In addition, NebulaFL can reduce up to 50% communication overhead and 61.94% costs under a target accuracy.
翻译:随着人工智能基础设施和可信执行环境(TEE)技术的进步,通过联合云计算(JCC)实现的联邦学习即服务(FLaaS)有望突破传统联邦学习(FL)范式中异构边缘设备带来的资源限制。具体而言,在TEE的保护下,数据所有者可以利用云端的高性能AI服务实现高效的模型训练。通过提供额外的FL服务,云服务提供商可以实现数据所有者之间的协同学习。然而,FLaaS仍面临三大挑战:i) 数据所有者间数据异构导致的训练性能低下,ii) 不同云(即数据中心)间的高通信开销,以及iii) 缺乏平衡训练时间与成本的高效资源调度策略。为应对这些挑战,本文提出了一种名为NebulaFL的新型异步联邦学习方法,用于多云间的协同模型训练。为解决数据异构问题,NebulaFL在每个数据中心采用基于版本控制的异步FL训练方案,以平衡数据所有者间的训练时间。为降低通信开销,NebulaFL采用去中心化的模型轮转机制,实现数据中心间有效的知识共享。为平衡训练时间与成本,NebulaFL集成了基于奖励引导的数据所有者选择与资源调度策略。实验结果表明,与最先进的FL方法相比,NebulaFL最高可实现5.71%的精度提升。此外,在目标精度下,NebulaFL最高可降低50%的通信开销和61.94%的成本。