Cross-device federated learning (FL) has been well-studied from algorithmic, system scalability, and training speed perspectives. Nonetheless, moving from centralized training to cross-device FL for millions or billions of devices presents many risks, including performance loss, developer inertia, poor user experience, and unexpected application failures. In addition, the corresponding infrastructure, development costs, and return on investment are difficult to estimate. In this paper, we present a device-cloud collaborative FL platform that integrates with an existing machine learning platform, providing tools to measure real-world constraints, assess infrastructure capabilities, evaluate model training performance, and estimate system resource requirements to responsibly bring FL into production. We also present a decision workflow that leverages the FL-integrated platform to comprehensively evaluate the trade-offs of cross-device FL and share our empirical evaluations of business-critical machine learning applications that impact hundreds of millions of users.
翻译:跨设备联邦学习在算法、系统可扩展性和训练速度方面已得到充分研究。然而,将集中式训练迁移至涉及数百万或数十亿设备的跨设备联邦学习仍面临诸多风险,包括性能损失、开发者惯性、用户体验不佳以及意外应用故障。此外,相应基础设施、开发成本和投资回报率也难以评估。本文提出一种设备-云端协作的联邦学习平台,该平台可与现有机器学习平台集成,提供用于测量真实约束条件、评估基础设施能力、评价模型训练性能以及预估系统资源需求的工具,从而负责任地将联邦学习投入生产应用。我们还提出了一个决策工作流,利用该联邦学习集成平台全面评估跨设备联邦学习的权衡因素,并分享我们对影响数亿用户的业务关键型机器学习应用的经验性评估结果。