Deploying a Hierarchical Federated Learning (HFL) pipeline across the computing continuum (CC) requires careful organization of participants into a hierarchical structure with intermediate aggregation nodes between FL clients and the global FL server. This is challenging to achieve due to (i) cost constraints, (ii) varying data distributions, and (iii) the volatile operating environment of the CC. In response to these challenges, we present a framework for the adaptive orchestration of HFL pipelines, designed to be reactive to client churn and infrastructure-level events, while balancing communication cost and ML model accuracy. Our mechanisms identify and react to events that cause HFL reconfiguration actions at runtime, building on multi-level monitoring information (model accuracy, resource availability, resource cost). Moreover, our framework introduces a generic methodology for estimating reconfiguration costs to continuously re-evaluate the quality of adaptation actions, while being extensible to optimize for various HFL performance criteria. By extending the Kubernetes ecosystem, our framework demonstrates the ability to react promptly and effectively to changes in the operating environment, making the best of the available communication cost budget and effectively balancing costs and ML performance at runtime.
翻译:在计算连续体上部署分层联邦学习管道需要将参与者精心组织成层次化结构,在联邦学习客户端与全局联邦学习服务器之间设置中间聚合节点。由于(i)成本约束、(ii)数据分布变化以及(iii)计算连续体运行环境的不稳定性,实现这一目标具有挑战性。针对这些挑战,我们提出了一种自适应编排分层联邦学习管道的框架,该框架旨在对客户端流失和基础设施级事件做出反应,同时平衡通信成本与机器学习模型精度。我们的机制基于多级监控信息(模型精度、资源可用性、资源成本),识别并响应导致运行时分层联邦学习重新配置操作的事件。此外,我们的框架引入了一种通用的重新配置成本估算方法,用于持续重新评估自适应操作的质量,同时可扩展以优化各种分层联邦学习性能指标。通过扩展Kubernetes生态系统,我们的框架展示了对运行环境变化做出快速有效响应的能力,充分利用可用的通信成本预算,并在运行时有效平衡成本与机器学习性能。