Distributed optimization methods such as DiLoCo have been shown to be effective in training very large models across multiple distributed workers, such as datacenters. These methods split updates into two parts: an inner optimization phase, where the workers independently execute multiple optimization steps on their own local data, and an outer optimization step, where the inner updates are synchronized. While such approaches require orders of magnitude less communication than standard data-parallel training, in settings where the workers are datacenters, even the limited communication requirements of these approaches can still cause significant slow downs due to the blocking necessary at each outer optimization step. In this paper, we investigate techniques to mitigate this issue by overlapping communication with computation in a manner that allows the outer optimization step to fully overlap with the inner optimization phase. We show that a particular variant, dubbed eager updates, provides competitive performance with standard DiLoCo in settings with low bandwidth between workers.
翻译:诸如DiLoCo之类的分布式优化方法已被证明在跨多个分布式工作节点(例如数据中心)训练超大规模模型方面具有显著效果。这些方法将更新过程划分为两个部分:内部优化阶段,各工作节点基于本地数据独立执行多步优化;以及外部优化步骤,用于同步内部更新结果。虽然此类方法所需的通信量比标准数据并行训练低数个数量级,但在工作节点为数据中心的场景中,由于每个外部优化步骤所需的阻塞操作,即使这些方法有限的通信需求仍可能导致显著的性能下降。本文研究通过以通信与计算重叠执行的方式来缓解该问题,使得外部优化步骤能够完全与内部优化阶段相重叠。我们证明一种称为急切更新的特定变体,在工作节点间带宽较低的环境中,能够提供与标准DiLoCo相竞争的性能表现。