Federated learning (FL) aggregation on serverless platforms faces a hard scalability ceiling: existing architectures (lambda-FL, LIFL) partition clients across aggregators, but every aggregator must hold the complete model gradient in memory. When gradients exceed the per-function memory limit (e.g., 10 GB on AWS Lambda), aggregation becomes infeasible regardless of tree depth or branching factor. We propose GradsSharding, which instead partitions the gradient tensor into M shards, each averaged independently by a serverless function that receives contributions from all clients. Because FedAvg averaging is element-wise, this produces bit-identical results to tree-based approaches, so model accuracy is invariant by construction. Per-function memory is bounded at O(|θ|/M), independent of client count, enabling aggregation of arbitrarily large models. We evaluate GradsSharding against lambda-FL and LIFL through HPC experiments and real AWS Lambda deployments across model sizes from 43 MB to 5 GB. Results show a cost crossover at approximately 500 MB gradient size, 2.7x cost reduction at VGG-16 scale, and that GradsSharding is the only architecture that remains deployable beyond the serverless memory ceiling.
翻译:联邦学习(FL)在无服务器平台上的聚合面临严峻的可扩展性瓶颈:现有架构(lambda-FL、LIFL)通过将客户端划分至不同聚合器,但每个聚合器必须在内存中保存完整的模型梯度。当梯度超过单函数内存限制(如AWS Lambda的10 GB)时,无论树深度或分支因子如何调整,聚合均不可行。我们提出GradsSharding方法,将梯度张量划分为M个分片,每个分片由一个接收所有客户端贡献的无服务器函数独立求平均。由于FedAvg聚合为逐元素操作,该方法与基于树的聚合方式在比特级上结果一致,因此模型精度在构造上保持不变。单函数内存上限为O(|θ|/M),与客户端数量无关,从而实现对任意大规模模型的聚合。我们通过HPC实验与真实AWS Lambda部署,在模型大小从43 MB到5 GB的范围内,将GradsSharding与lambda-FL、LIFL进行对比评估。结果表明,在梯度大小约为500 MB时出现成本转折点;在VGG-16规模下成本降低2.7倍;且GradsSharding是唯一在无服务器内存极限以上仍可部署的架构。