DPDL: Towards Differential Privacy Preservation in Decentralized Stochastic Learning on Non-IID Data

In the paradigm of decentralized learning, a group of agents collaborate to train a global model using distributed datasets without a central server. Although the power of collaboration has been verified by many state-of-the-art studies, it entails extensive gradient information exchanging among the agents and thus induces high risk of privacy leakage for the individual agents. Moreover, in real-world applications, the training data are usually non-identically and independently distributed across the agents, inducing more challenges to enable privacy-preserved decentralized learning. To address these issues, we propose a privacy-preserved decentralized learning algorithm with non-IID data, DPDL, which leverages the notion of Differential Privacy (DP) in cross-gradient aggregation through a similarity-based calibration technique. Specifically, in each round, each agent perturbs the cross-gradients (i.e., the derivatives of its neighbors' local model in its private local data) by Gaussian noise mechanism before sharing them with its neighbors; it then adopt cosine similarity to calibrate the received perturbed cross-gradients such that the aggregation of the calibrated cross-gradients can be utilized to effectively update local model in a momentum-like manner. Our rigorous theoretical analysis not only reveals the minimum noise level required to achieve a specific level of privacy preservation, but also illustrates that our algorithm still achieves a linear speedup in training with non-IID data. We finally conduct extensive experiments on real-world dataset to validate the effectiveness of our algorithm in defending privacy attacks and in training accurate models.

翻译：在去中心化学习范式下，一组智能体利用分布式数据集协作训练全局模型，无需中央服务器。尽管协作能力已得到众多前沿研究的验证，但这要求智能体之间广泛交换梯度信息，从而为单个智能体带来极高的隐私泄露风险。此外，在实际应用中，训练数据在各智能体间通常呈非独立同分布，这为保障隐私的去中心化学习带来了更多挑战。为解决这些问题，我们提出了一种面向非独立同分布数据的隐私保护去中心化学习算法DPDL，该方法在交叉梯度聚合中借助基于相似度校准的差分隐私概念。具体而言，在每轮训练中，每个智能体通过高斯噪声机制对其交叉梯度（即邻域局部模型在其私有本地数据上的导数）进行扰动，再与邻居共享；随后采用余弦相似度对接收到的扰动交叉梯度进行校准，使得校准后的交叉梯度聚合能以类动量方式有效更新本地模型。我们严格的理论分析不仅揭示了实现特定隐私保护水平所需的最小噪声强度，还证明了该算法在处理非独立同分布数据时仍能实现训练线性加速。最后，在真实数据集上进行的大量实验验证了该算法在防御隐私攻击和训练精确模型方面的有效性。