Federated Learning (FL) is a privacy-focused machine learning paradigm that collaboratively trains models directly on edge devices. Simulated environments are crucial for large-scale FL research, allowing scientists to quickly test new ideas without acquiring millions of devices. However, current simulators cannot match the scale necessary to emulate production systems or push the boundaries of research in a time-efficient manner. This work proposes \emph{Pollen}, a novel resource-aware system for speeding up simulations. \emph{Pollen} addresses two limiting factors from previous systems: (a) communication inefficiency in pull-based client execution and (b) ignoring system inefficiencies from simulation-hardware diversity. \emph{Pollen} executes high-throughput FL simulations at scale by (a) using a push-based client placement system and (b) balancing clients across servers and their GPUs with a novel online machine learning model. Furthermore, \emph{Pollen}'s placement model reduces GPU idle time by up to 50\% by providing accurate training time predictions, allowing researchers to run extensive experiments sampling from millions of clients. Our experiments evaluate \pollen on four representative FL tasks. We compare \emph{Pollen} to ad-hoc FL frameworks, \emph{Flower}, \emph{Flute}, \emph{FedScale}, and \emph{Parrot}, and show experimental speed-ups of days or weeks.
翻译:联邦学习(FL)是一种以隐私为核心的机器学习范式,可直接在边缘设备上协同训练模型。模拟环境对于大规模FL研究至关重要,它使科学家能够在无需获取数百万台设备的情况下快速测试新想法。然而,当前模拟器无法达到模拟生产系统或高效突破研究边界所需的规模。本文提出了一种新颖的资源感知加速模拟系统\emph{Pollen}。\emph{Pollen}解决了以往系统的两大限制因素:(a)拉式客户端执行中的通信低效问题;(b)忽略模拟硬件多样性导致的系统低效问题。\emph{Pollen}通过以下方式实现大规模高吞吐量FL模拟:(a)采用推式客户端放置系统;(b)利用新型在线机器学习模型在服务器及其GPU间平衡客户端分布。此外,\emph{Pollen}的放置模型通过提供准确的训练时间预测,将GPU空闲时间降低高达50%,使研究人员能够从数百万客户端中采样并运行大规模实验。我们在四个代表性FL任务上评估了\emph{Pollen},并将其与临时FL框架\emph{Flower}、\emph{Flute}、\emph{FedScale}和\emph{Parrot}进行对比,实验表明可实现数天甚至数周的速度提升。