Federated Learning (FL) is a distributed Machine Learning (ML) technique that can benefit from cloud environments while preserving data privacy. We propose Multi-FedLS, a framework that manages multi-cloud resources, reducing execution time and financial costs of Cross-Silo Federated Learning applications by using preemptible VMs, cheaper than on-demand ones but that can be revoked at any time. Our framework encloses four modules: Pre-Scheduling, Initial Mapping, Fault Tolerance, and Dynamic Scheduler. This paper extends our previous work \cite{brum2022sbac} by formally describing the Multi-FedLS resource manager framework and its modules. Experiments were conducted with three Cross-Silo FL applications on CloudLab and a proof-of-concept confirms that Multi-FedLS can be executed on a multi-cloud composed by AWS and GCP, two commercial cloud providers. Results show that the problem of executing Cross-Silo FL applications in multi-cloud environments with preemptible VMs can be efficiently resolved using a mathematical formulation, fault tolerance techniques, and a simple heuristic to choose a new VM in case of revocation.
翻译:联邦学习(Federated Learning, FL)是一种分布式机器学习(Machine Learning, ML)技术,能够利用云环境的同时保护数据隐私。我们提出Multi-FedLS框架,该框架通过使用抢占式虚拟机(preemptible VMs)管理多云资源,降低跨数据孤岛联邦学习应用的执行时间与财务成本。此类虚拟机虽成本低于按需实例,但随时可能被回收。本框架包含四个模块:预调度模块、初始映射模块、容错模块与动态调度模块。本文在先前工作\cite{brum2022sbac}的基础上,正式描述了Multi-FedLS资源管理器框架及其各模块。我们在CloudLab上使用三个跨数据孤岛FL应用开展实验,概念验证表明Multi-FedLS可在由AWS与GCP(两大商业云服务提供商)构成的多云环境中运行。结果表明,通过数学建模、容错技术及在虚拟机回收时选择新实例的简单启发式方法,可高效解决在多云环境中使用抢占式虚拟机执行跨数据孤岛FL应用的问题。