A Neural Tangent Kernel View on Federated Averaging for Deep Linear Neural Network

Federated averaging (FedAvg) is a widely employed paradigm for collaboratively training models from distributed clients without sharing data. Nowadays, the neural network has achieved remarkable success due to its extraordinary performance, which makes it a preferred choice as the model in FedAvg. However, the optimization problem of the neural network is often non-convex even non-smooth. Furthermore, FedAvg always involves multiple clients and local updates, which results in an inaccurate updating direction. These properties bring difficulties in analyzing the convergence of FedAvg in training neural networks. Recently, neural tangent kernel (NTK) theory has been proposed towards understanding the convergence of first-order methods in tackling the non-convex problem of neural networks. The deep linear neural network is a classical model in theoretical subject due to its simple formulation. Nevertheless, there exists no theoretical result for the convergence of FedAvg in training the deep linear neural network. By applying NTK theory, we make a further step to provide the first theoretical guarantee for the global convergence of FedAvg in training deep linear neural networks. Specifically, we prove FedAvg converges to the global minimum at a linear rate $\mathcal{O}\big((1-\eta K /N)^t\big)$, where $t$ is the number of iterations, $\eta$ is the learning rate, $N$ is the number of clients and $K$ is the number of local updates. Finally, experimental evaluations on two benchmark datasets are conducted to empirically validate the correctness of our theoretical findings.

翻译：联邦平均（FedAvg）是一种广泛使用的范式，用于在分布式客户端之间协同训练模型而无需共享数据。如今，神经网络凭借其卓越性能取得了显著成功，这使其成为FedAvg中模型的首选。然而，神经网络的优化问题通常是非凸甚至非光滑的。此外，FedAvg总是涉及多个客户端和本地更新，这可能导致更新方向不准确。这些特性为分析FedAvg训练神经网络时的收敛性带来了困难。近年来，神经正切核（NTK）理论被提出，用于理解一阶方法解决神经网络非凸问题时的收敛性。深度线性神经网络因其简洁的公式而成为理论课题中的经典模型。然而，目前尚未有关于FedAvg训练深度线性神经网络收敛性的理论结果。通过应用NTK理论，我们进一步迈出一步，首次为FedAvg训练深度线性神经网络的全局收敛性提供了理论保证。具体来说，我们证明了FedAvg以线性速率$\mathcal{O}\big((1-\eta K /N)^t\big)$收敛到全局最小值，其中$t$为迭代次数，$\eta$为学习率，$N$为客户端数量，$K$为本地更新次数。最后，在两个基准数据集上进行了实验评估，以实证验证我们理论发现的正确性。

相关内容

Neural Networks

关注 1654

神经网络（Neural Networks）是世界上三个最古老的神经建模学会的档案期刊:国际神经网络学会(INNS)、欧洲神经网络学会(ENNS)和日本神经网络学会(JNNS)。神经网络提供了一个论坛，以发展和培育一个国际社会的学者和实践者感兴趣的所有方面的神经网络和相关方法的计算智能。神经网络欢迎高质量论文的提交，有助于全面的神经网络研究，从行为和大脑建模，学习算法，通过数学和计算分析，系统的工程和技术应用，大量使用神经网络的概念和技术。这一独特而广泛的范围促进了生物和技术研究之间的思想交流，并有助于促进对生物启发的计算智能感兴趣的跨学科社区的发展。因此，神经网络编委会代表的专家领域包括心理学，神经生物学，计算机科学，工程，数学，物理。该杂志发表文章、信件和评论以及给编辑的信件、社论、时事、软件调查和专利信息。文章发表在五个部分之一:认知科学，神经科学，学习系统，数学和计算分析、工程和应用。官网地址：http://dblp.uni-trier.de/db/journals/nn/

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日