Decentralized learning over distributed datasets can have significantly different data distributions across the agents. The current state-of-the-art decentralized algorithms mostly assume the data distributions to be Independent and Identically Distributed. This paper focuses on improving decentralized learning over non-IID data. We propose \textit{Neighborhood Gradient Clustering (NGC)}, a novel decentralized learning algorithm that modifies the local gradients of each agent using self- and cross-gradient information. Cross-gradients for a pair of neighboring agents are the derivatives of the model parameters of an agent with respect to the dataset of the other agent. In particular, the proposed method replaces the local gradients of the model with the weighted mean of the self-gradients, model-variant cross-gradients (derivatives of the neighbors' parameters with respect to the local dataset), and data-variant cross-gradients (derivatives of the local model with respect to its neighbors' datasets). The data-variant cross-gradients are aggregated through an additional communication round without breaking the privacy constraints. Further, we present \textit{CompNGC}, a compressed version of \textit{NGC} that reduces the communication overhead by $32 \times$. We theoretically analyze the convergence rate of the proposed algorithm and demonstrate its efficiency over non-IID data sampled from {various vision and language} datasets trained. Our experiments demonstrate that \textit{NGC} and \textit{CompNGC} outperform (by $0-6\%$) the existing SoTA decentralized learning algorithm over non-IID data with significantly less compute and memory requirements. Further, our experiments show that the model-variant cross-gradient information available locally at each agent can improve the performance over non-IID data by $1-35\%$ without additional communication cost.
翻译:去中心化学习中,分布式数据集在各智能体上的数据分布可能显著不同。当前最先进的去中心化算法大多假设数据分布为独立同分布。本文聚焦于改进非独立同分布数据下的去中心化学习。我们提出\textit{邻域梯度聚类(NGC)},一种利用自梯度和交叉梯度信息修改每个智能体局部梯度的新型去中心化学习算法。对于一对相邻智能体,交叉梯度是指一个智能体模型参数相对于另一个智能体数据集的导数。具体而言,该方法用自梯度、模型变体交叉梯度(邻居模型参数相对于本地数据集的导数)和数据变体交叉梯度(本地模型相对于邻居数据集的导数)的加权平均值替换模型的局部梯度。数据变体交叉梯度通过额外一轮通信聚合,且不违反隐私约束。此外,我们提出\textit{CompNGC},这是\textit{NGC}的压缩版本,可将通信开销降低$32 \times$。我们从理论上分析了所提算法的收敛速度,并展示了其在从各种视觉和语言数据集采样的非独立同分布数据上的高效性。实验表明,\textit{NGC}和\textit{CompNGC}在非独立同分布数据上以显著更少的计算和内存需求优于现有最先进的去中心化学习算法(性能提升$0-6\%$)。进一步实验显示,每个智能体本地可获取的模型变体交叉梯度信息可在不增加通信成本的情况下,将非独立同分布数据上的性能提升$1-35\%$。