Decentralized SGD can run with low communication costs, but its sparse communication characteristics deteriorate the convergence rate, especially when the number of nodes is large. In decentralized learning settings, communication is assumed to occur on only a given topology, while in many practical cases, the topology merely represents a preferred communication pattern, and connecting to arbitrary nodes is still possible. Previous studies have tried to alleviate the convergence rate degradation in these cases by designing topologies with large spectral gaps. However, the degradation is still significant when the number of nodes is substantial. In this work, we propose TELEPORTATION. TELEPORTATION activates only a subset of nodes, and the active nodes fetch the parameters from previous active nodes. Then, the active nodes update their parameters by SGD and perform gossip averaging on a relatively small topology comprising only the active nodes. We show that by activating only a proper number of nodes, TELEPORTATION can completely alleviate the convergence rate degradation. Furthermore, we propose an efficient hyperparameter-tuning method to search for the appropriate number of nodes to be activated. Experimentally, we showed that TELEPORTATION can train neural networks more stably and achieve higher accuracy than Decentralized SGD.
翻译:去中心化随机梯度下降(SGD)能够以较低的通信开销运行,但其稀疏的通信特性会降低收敛速度,尤其是在节点数量较多时。在去中心化学习设置中,通信通常被假定仅在给定的拓扑结构上进行,然而在许多实际场景中,拓扑结构仅代表一种首选的通信模式,与任意节点建立连接仍然是可行的。先前的研究尝试通过设计具有大谱隙的拓扑结构来缓解这些情况下的收敛速度下降问题。然而,当节点数量庞大时,性能下降依然显著。在本工作中,我们提出了遥传(TELEPORTATION)方法。遥传仅激活一个节点子集,活跃节点从先前的活跃节点获取参数。随后,活跃节点通过SGD更新其参数,并在一个仅由活跃节点构成的相对较小的拓扑结构上进行八卦平均。我们证明,通过仅激活适当数量的节点,遥传能够完全消除收敛速度的下降。此外,我们提出了一种高效的超参数调优方法,以搜索需要激活的合适节点数量。实验结果表明,与去中心化SGD相比,遥传能够更稳定地训练神经网络并获得更高的准确率。