We study practical data characteristics underlying federated learning, where non-i.i.d. data from clients have sparse features, and a certain client's local data normally involves only a small part of the full model, called a submodel. Due to data sparsity, the classical federated averaging (FedAvg) algorithm or its variants will be severely slowed down, because when updating the global model, each client's zero update of the full model excluding its submodel is inaccurately aggregated. Therefore, we propose federated submodel averaging (FedSubAvg), ensuring that the expectation of the global update of each model parameter is equal to the average of the local updates of the clients who involve it. We theoretically proved the convergence rate of FedSubAvg by deriving an upper bound under a new metric called the element-wise gradient norm. In particular, this new metric can characterize the convergence of federated optimization over sparse data, while the conventional metric of squared gradient norm used in FedAvg and its variants cannot. We extensively evaluated FedSubAvg over both public and industrial datasets. The evaluation results demonstrate that FedSubAvg significantly outperforms FedAvg and its variants.
翻译:我们研究了联邦学习中实际的数据特征:来自客户端的非独立同分布数据具有稀疏特征,且特定客户端的本地数据通常仅涉及完整模型的一小部分(称为子模型)。由于数据稀疏性,经典联邦平均算法(FedAvg)及其变体将严重降速——因为当更新全局模型时,每个客户端对完整模型中除其子模型外的零梯度更新会被不精确地聚合。为此,我们提出联邦子模型平均算法(FedSubAvg),确保每个模型参数的全局更新期望等于涉及其更新的客户端本地更新的平均值。我们通过推导一种名为“逐元素梯度范数”的新度量下的上界,从理论上证明了FedSubAvg的收敛速率。值得注意的是,该新度量能刻画稀疏数据下联邦优化的收敛特性,而FedAvg及其变体中使用的传统平方梯度范数度量无法实现这一点。我们在公开数据集和工业数据集上对FedSubAvg进行了全面评估。评估结果表明,FedSubAvg显著优于FedAvg及其变体。