Apodotiko: Enabling Efficient Serverless Federated Learning in Heterogeneous Environments

Federated Learning (FL) is an emerging machine learning paradigm that enables the collaborative training of a shared global model across distributed clients while keeping the data decentralized. Recent works on designing systems for efficient FL have shown that utilizing serverless computing technologies, particularly Function-as-a-Service (FaaS) for FL, can enhance resource efficiency, reduce training costs, and alleviate the complex infrastructure management burden on data holders. However, current serverless FL systems still suffer from the presence of stragglers, i.e., slow clients that impede the collaborative training process. While strategies aimed at mitigating stragglers in these systems have been proposed, they overlook the diverse hardware resource configurations among FL clients. To this end, we present Apodotiko, a novel asynchronous training strategy designed for serverless FL. Our strategy incorporates a scoring mechanism that evaluates each client's hardware capacity and dataset size to intelligently prioritize and select clients for each training round, thereby minimizing the effects of stragglers on system performance. We comprehensively evaluate Apodotiko across diverse datasets, considering a mix of CPU and GPU clients, and compare its performance against five other FL training strategies. Results from our experiments demonstrate that Apodotiko outperforms other FL training strategies, achieving an average speedup of 2.75x and a maximum speedup of 7.03x. Furthermore, our strategy significantly reduces cold starts by a factor of four on average, demonstrating suitability in serverless environments.

翻译：联邦学习（FL）是一种新兴的机器学习范式，能够在保持数据去中心化的同时，通过分布式客户端协作训练共享的全局模型。近期关于高效FL系统设计的研究表明，利用无服务器计算技术，尤其是函数即服务（FaaS）实现FL，可以提升资源效率、降低训练成本，并减轻数据持有方复杂的基础设施管理负担。然而，当前无服务器FL系统仍受困于掉队者（即阻碍协作训练过程的慢速客户端）的存在。尽管已有针对此类系统的掉队者缓解策略，但这些方法忽视了FL客户端之间硬件资源配置的差异性。为此，我们提出Apodotiko——一种专为无服务器FL设计的新型异步训练策略。该策略通过评分机制评估每个客户端的硬件容量与数据集大小，从而智能地优先选择参与每轮训练的客户端，最大程度降低掉队者对系统性能的影响。我们在考虑CPU与GPU客户端混合场景下，基于多种数据集对Apodotiko进行全面评估，并将其性能与其他五种FL训练策略进行对比。实验结果表明，Apodotiko显著优于其他FL训练策略：平均加速比达2.75倍，最高加速比达7.03倍。此外，该策略平均将冷启动次数降低四倍，充分验证了其在无服务器环境中的适用性。