Bilevel optimization, crucial for hyperparameter tuning, meta-learning and reinforcement learning, remains less explored in the decentralized learning paradigm, such as decentralized federated learning (DFL). Typically, decentralized bilevel methods rely on both gradients and Hessian matrices to approximate hypergradients of upper-level models. However, acquiring and sharing the second-order oracle is compute and communication intensive. % and sharing this information incurs heavy communication overhead. To overcome these challenges, this paper introduces a fully first-order decentralized method for decentralized Bilevel optimization, $\text{C}^2$DFB which is both compute- and communicate-efficient. In $\text{C}^2$DFB, each learning node optimizes a min-min-max problem to approximate hypergradient by exclusively using gradients information. To reduce the traffic load at the inner-loop of solving the lower-level problem, $\text{C}^2$DFB incorporates a lightweight communication protocol for efficiently transmitting compressed residuals of local parameters. % during the inner loops. Rigorous theoretical analysis ensures its convergence % of the algorithm, indicating a first-order oracle calls of $\tilde{\mathcal{O}}(\epsilon^{-4})$. Experiments on hyperparameter tuning and hyper-representation tasks validate the superiority of $\text{C}^2$DFB across various typologies and heterogeneous data distributions.
翻译:双层优化在超参数调优、元学习和强化学习中至关重要,但在去中心化学习范式(如去中心化联邦学习,DFL)中仍较少被探索。通常,去中心化双层方法依赖于梯度和Hessian矩阵来逼近上层模型的超梯度。然而,获取和共享二阶信息在计算和通信上都是密集的。为了克服这些挑战,本文提出了一种全一阶去中心化双层优化方法$\text{C}^2$DFB,该方法在计算和通信上均高效。在$\text{C}^2$DFB中,每个学习节点通过仅使用梯度信息优化一个最小-最小-最大问题来逼近超梯度。为了减少在求解下层问题内循环中的通信负载,$\text{C}^2$DFB采用了一种轻量级通信协议,用于高效传输本地参数的压缩残差。严格的理论分析确保了其收敛性,表明其一阶信息调用复杂度为$\tilde{\mathcal{O}}(\epsilon^{-4})$。在超参数调优和超表示任务上的实验验证了$\text{C}^2$DFB在各种拓扑结构和异构数据分布上的优越性。