Decentralized learning offers privacy and communication efficiency when data are naturally distributed among agents communicating over an underlying graph. Motivated by overparameterized learning settings, in which models are trained to zero training loss, we study algorithmic and generalization properties of decentralized learning with gradient descent on separable data. Specifically, for decentralized gradient descent (DGD) and a variety of loss functions that asymptote to zero at infinity (including exponential and logistic losses), we derive novel finite-time generalization bounds. This complements a long line of recent work that studies the generalization performance and the implicit bias of gradient descent over separable data, but has thus far been limited to centralized learning scenarios. Notably, our generalization bounds approximately match in order their centralized counterparts. Critical behind this, and of independent interest, is establishing novel bounds on the training loss and the rate-of-consensus of DGD for a class of self-bounded losses. Finally, on the algorithmic front, we design improved gradient-based routines for decentralized learning with separable data and empirically demonstrate orders-of-magnitude of speed-up in terms of both training and generalization performance.
翻译:分散式学习在数据天然分布于底层图通信的智能体之间时,能够提供隐私保护和通信效率。受过参数化学习场景(模型训练至零训练损失)的启发,我们研究了基于可分离数据、采用梯度下降法的分散式学习的算法特性与泛化性质。具体而言,针对分散式梯度下降(DGD)以及一系列在无穷远处趋近于零的损失函数(包括指数损失和对数损失),我们推导了新颖的有限时间泛化界。这补充了近期关于可分离数据上梯度下降法泛化性能与隐式偏差的大量研究,但此前这些研究仅局限于集中式学习场景。值得注意的是,我们的泛化界在数量级上近似匹配其集中式学习对应结果。这一结果的关键(同时具有独立意义)在于:针对一类自受限损失函数,我们建立了关于DGD训练损失与收敛速率的新颖界。最后,在算法层面,我们为可分离数据分散式学习设计了改进的基于梯度的算法,并实验证明其在训练和泛化性能方面均实现了数量级的加速。