Federated Optimization of Smooth Loss Functions

In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server minimizes an ERM objective function using training data that is stored across $m$ clients. In this setting, the Federated Averaging (FedAve) algorithm is the staple for determining $\epsilon$-approximate solutions to the ERM problem. Similar to standard optimization algorithms, the convergence analysis of FedAve only relies on smoothness of the loss function in the optimization parameter. However, loss functions are often very smooth in the training data too. To exploit this additional smoothness, we propose the Federated Low Rank Gradient Descent (FedLRGD) algorithm. Since smoothness in data induces an approximate low rank structure on the loss function, our method first performs a few rounds of communication between the server and clients to learn weights that the server can use to approximate clients' gradients. Then, our method solves the ERM problem at the server using inexact gradient descent. To show that FedLRGD can have superior performance to FedAve, we present a notion of federated oracle complexity as a counterpart to canonical oracle complexity. Under some assumptions on the loss function, e.g., strong convexity in parameter, $\eta$-H\"older smoothness in data, etc., we prove that the federated oracle complexity of FedLRGD scales like $\phi m(p/\epsilon)^{\Theta(d/\eta)}$ and that of FedAve scales like $\phi m(p/\epsilon)^{3/4}$ (neglecting sub-dominant factors), where $\phi\gg 1$ is a "communication-to-computation ratio," $p$ is the parameter dimension, and $d$ is the data dimension. Then, we show that when $d$ is small and the loss function is sufficiently smooth in the data, FedLRGD beats FedAve in federated oracle complexity. Finally, in the course of analyzing FedLRGD, we also establish a result on low rank approximation of latent variable models.

翻译：本文研究联邦学习框架下的经验风险最小化问题，其中中央服务器利用分布在$m$个客户端上的训练数据最小化经验风险目标函数。在此设定下，联邦平均算法是求解经验风险问题$\epsilon$-近似解的标准方法。与标准优化算法类似，FedAve的收敛性分析仅依赖于损失函数在优化参数上的光滑性。然而，损失函数在训练数据上往往也具有高度光滑性。为利用这一额外的光滑性，我们提出联邦低秩梯度下降算法。由于数据光滑性会诱导损失函数的近似低秩结构，该方法首先在服务器与客户端之间进行数轮通信，使服务器学习可用于近似客户端梯度的权重，继而通过非精确梯度下降在服务器端求解经验风险问题。为证明FedLRGD可比FedAve具有更优性能，我们提出联邦预言复杂度概念作为规范预言复杂度的对应物。在损失函数满足参数强凸性、数据$\eta$-Hölder光滑性等假设下，我们证明FedLRGD的联邦预言复杂度为$\phi m(p/\epsilon)^{\Theta(d/\eta)}$，而FedAve的复杂度为$\phi m(p/\epsilon)^{3/4}$（忽略次主导因子），其中$\phi\gg 1$表示通信计算比，$p$为参数维度，$d$为数据维度。进而证明当$d$较小且损失函数在数据上充分光滑时，FedLRGD在联邦预言复杂度上优于FedAve。最后，在分析FedLRGD的过程中，我们还建立了关于潜变量模型低秩近似的新结论。

相关内容

损失函数（机器学习）

关注 10

损失函数，在AI中亦称呼距离函数，度量函数。此处的距离代表的是抽象性的，代表真实数据与预测数据之间的误差。损失函数（loss function）是用来估量你模型的预测值f(x)与真实值Y的不一致程度，它是一个非负实值函数,通常使用L(Y, f(x))来表示，损失函数越小，模型的鲁棒性就越好。损失函数是经验风险函数的核心部分，也是结构风险函数重要组成部分。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日