Cross-device federated learning (FL) is a technique that trains a model on data distributed across typically millions of edge devices without data leaving the devices. SGD is the standard client optimizer for on device training in cross-device FL, favored for its memory and computational efficiency. However, in centralized training of neural language models, adaptive optimizers are preferred as they offer improved stability and performance. In light of this, we ask if language models can be modified such that they can be efficiently trained with SGD client optimizers and answer this affirmatively. We propose a scale-invariant Coupled Input Forget Gate (SI CIFG) recurrent network by modifying the sigmoid and tanh activations in the recurrent cell and show that this new model converges faster and achieves better utility than the standard CIFG recurrent model in cross-device FL in large scale experiments. We further show that the proposed scale invariant modification also helps in federated learning of larger transformer models. Finally, we demonstrate the scale invariant modification is also compatible with other non-adaptive algorithms. Particularly, our results suggest an improved privacy utility trade-off in federated learning with differential privacy.
翻译:跨设备联邦学习是一种技术,可在通常数百万边缘设备上分布的数据上训练模型,且数据无需离开设备。SGD 是跨设备联邦学习中设备端训练的标准客户端优化器,因其内存和计算效率高而备受青睐。然而,在神经语言模型的集中式训练中,自适应优化器更受青睐,因其能提供更好的稳定性和性能。为此,我们探究是否可以对语言模型进行修改,使其能够通过 SGD 客户端优化器高效训练,并给出了肯定答案。我们通过修改循环单元中的 sigmoid 和 tanh 激活函数,提出了一种尺度不变耦合输入遗忘门(SI CIFG)循环网络,并在大规模实验中证明,该新模型在跨设备联邦学习中收敛更快,且比标准 CIFG 循环模型具有更好的实用性。我们进一步表明,所提出的尺度不变修改也有助于更大规模 Transformer 模型的联邦学习。最后,我们证明尺度不变修改同样适用于其他非自适应算法。特别是,我们的结果表明,在具有差分隐私的联邦学习中,隐私-效用权衡得到了改善。