Distributed machine learning has recently become a critical paradigm for training large models on vast datasets. We examine the stochastic optimization problem for deep learning within synchronous parallel computing environments under communication constraints. While averaging distributed gradients is the most widely used method for gradient estimation, whether this is the optimal strategy remains an open question. In this work, we analyze the distributed gradient aggregation process through the lens of subspace optimization. By formulating the aggregation problem as an objective-aware subspace optimization problem, we derive an efficient weighting scheme for gradients, guided by subspace coefficients. We further introduce subspace momentum to accelerate convergence while maintaining statistical unbiasedness in the aggregation. Our method demonstrates improved performance over the ubiquitous gradient averaging on multiple MLPerf tasks while remaining extremely efficient in both communicational and computational complexity.
翻译:分布式机器学习已成为在大规模数据集上训练大型模型的关键范式。本文研究了通信约束下同步并行计算环境中深度学习的随机优化问题。虽然平均分布式梯度是最广泛使用的梯度估计方法,但这是否为最优策略仍是一个开放性问题。在本工作中,我们通过子空间优化的视角分析分布式梯度聚合过程。通过将聚合问题构建为目标感知的子空间优化问题,我们推导出了一种由子空间系数引导的高效梯度加权方案。我们进一步引入了子空间动量以加速收敛,同时保持聚合过程的统计无偏性。我们的方法在多个MLPerf任务上表现出优于普遍使用的梯度平均方法的性能,同时在通信和计算复杂度方面保持极高的效率。