The rapid development of machine learning and deep learning has introduced increasingly complex optimization challenges that must be addressed. Indeed, training modern, advanced models has become difficult to implement without leveraging multiple computing nodes in a distributed environment. Distributed optimization is also fundamental to emerging fields such as federated learning. Specifically, there is a need to organize the training process to minimize the time lost due to communication. A widely used and extensively researched technique to mitigate the communication bottleneck involves performing local training before communication. This approach is the focus of our paper. Concurrently, adaptive methods that incorporate scaling, notably led by Adam, have gained significant popularity in recent years. Therefore, this paper aims to merge the local training technique with the adaptive approach to develop efficient distributed learning methods. We consider the classical Local SGD method and enhance it with a scaling feature. A crucial aspect is that the scaling is described generically, allowing us to analyze various approaches, including Adam, RMSProp, and OASIS, in a unified manner. In addition to theoretical analysis, we validate the performance of our methods in practice by training a neural network.
翻译:机器学习和深度学习的快速发展带来了日益复杂的优化挑战。事实上,若不利用分布式环境中的多个计算节点,训练现代先进模型已变得难以实现。分布式优化对于联邦学习等新兴领域也至关重要。具体而言,需要组织训练过程以最小化因通信而损失的时间。一种广泛使用且被深入研究的缓解通信瓶颈的技术是在通信前执行局部训练。这正是本文的研究重点。同时,近年来,结合缩放技术的自适应方法(以Adam为代表)获得了显著普及。因此,本文旨在将局部训练技术与自适应方法相融合,以开发高效的分布式学习方法。我们考虑经典的局部SGD方法,并为其增加缩放特性。关键之处在于,缩放特性被通用化描述,从而使我们能够以统一方式分析包括Adam、RMSProp和OASIS在内的多种方法。除理论分析外,我们通过训练神经网络在实践中验证了所提方法的性能。