Federated Learning (FL) is an emerging machine learning technique that enables distributed model training across data silos or edge devices without data sharing. Yet, FL inevitably introduces inefficiencies compared to centralized model training, which will further increase the already high energy usage and associated carbon emissions of machine learning in the future. One idea to reduce FL's carbon footprint is to schedule training jobs based on the availability of renewable excess energy that can occur at certain times and places in the grid. However, in the presence of such volatile and unreliable resources, existing FL schedulers cannot always ensure fast, efficient, and fair training. We propose FedZero, an FL system that operates exclusively on renewable excess energy and spare capacity of compute infrastructure to effectively reduce a training's operational carbon emissions to zero. Using energy and load forecasts, FedZero leverages the spatio-temporal availability of excess resources by selecting clients for fast convergence and fair participation. Our evaluation, based on real solar and load traces, shows that FedZero converges significantly faster than existing approaches under the mentioned constraints while consuming less energy. Furthermore, it is robust to forecasting errors and scalable to tens of thousands of clients.
翻译:联邦学习(FL)是一种新兴的机器学习技术,可在不共享数据的前提下,跨数据孤岛或边缘设备实现分布式模型训练。然而,与集中式模型训练相比,FL不可避免地会引入效率低下问题,这将进一步加剧机器学习领域已然高企的能源消耗及相关碳排放。减少FL碳足迹的一个思路是根据电网特定时间与地点出现的可再生能源过剩电量的可用性来调度训练任务。但在这种波动性强、可靠性低的资源条件下,现有FL调度器无法始终确保快速、高效且公平的训练。我们提出FedZero,一种完全基于可再生能源过剩电量及计算基础设施冗余容量运行的FL系统,从而有效将训练过程的运营碳排放降至零。通过利用能源与负载预测,FedZero通过选择客户端以实现快速收敛与公平参与,从而充分利用过剩资源的时空可用性。基于真实太阳能与负载数据的评估表明,在所述约束条件下,FedZero的收敛速度显著快于现有方法,且能耗更低。此外,该系统对预测误差具有鲁棒性,并可扩展至数万个客户端。