Cross-device federated learning (FL) has been well-studied from algorithmic, system scalability, and training speed perspectives. Nonetheless, moving from centralized training to cross-device FL for millions or billions of devices presents many risks, including performance loss, developer inertia, poor user experience, and unexpected application failures. In addition, the corresponding infrastructure, development costs, and return on investment are difficult to estimate. In this paper, we present a device-cloud collaborative FL platform that integrates with an existing machine learning platform, providing tools to measure real-world constraints, assess infrastructure capabilities, evaluate model training performance, and estimate system resource requirements to responsibly bring FL into production. We also present a decision workflow that leverages the FL-integrated platform to comprehensively evaluate the trade-offs of cross-device FL and share our empirical evaluations of business-critical machine learning applications that impact hundreds of millions of users.
翻译:跨设备联邦学习在算法、系统可扩展性及训练速度方面已得到充分研究。然而,从集中式训练迁移到针对数百万乃至数十亿设备的跨设备联邦学习,仍面临诸多风险,包括性能损失、开发惯性、用户体验不佳以及意外的应用故障。此外,相应的基础设施、开发成本及投资回报率也难以预估。本文提出一种设备-云端协同的联邦学习平台,该平台与现有机器学习平台集成,提供用于测量真实环境约束、评估基础设施能力、评估模型训练性能及预估系统资源需求的工具,从而以负责任的方式将联邦学习投入生产应用。我们还提出一种决策工作流,利用该联邦学习集成平台全面评估跨设备联邦学习的权衡取舍,并分享我们对影响数亿用户的关键业务机器学习应用的经验性评估。