Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the performance bottlenecks of D-KFAC, 2) we design and implement a pipelining mechanism for Kronecker factors computation and communication with dynamic tensor fusion, and 3) we develop a load balancing placement for inverting multiple matrices on GPU clusters. We conduct real-world experiments on a 64-GPU cluster with 100Gb/s InfiniBand interconnect. Experimental results show that our proposed SPD-KFAC training scheme can achieve 10%-35% improvement over state-of-the-art algorithms.

翻译：与同步随机梯度梯度下降(SGD)在 GPU 群集上进行分布式 KFAC (D-KFAC) 的分散培训已被广泛用于加速深层模型的培训过程。然而, SGD 只在模型参数更新中使用第一阶梯度, 可能需要数日或数周时间。最近的研究成功地利用了近似第二阶信息来加快培训过程, Kronecker- 受控的KFAC (KFAC) 是培训深层模型最高效的近距离算法之一。然而, 当利用 GPU 群来利用分布式 KFAC (D- KFAC) 来培训模型时, 它需要大量计算并在每次迭代期间引入额外的通信。在这项工作中,我们提议D- KFAC (SPD-K-KFAC) 与智能的计算和通信任务平行, 以缩短循环时间。具体地说, 1,我们首先将 D- KFAC 的性能瓶颈描述为D- 和动态州调调调和 3) 我们开发了一种负负负负平衡配置, 在 GPU 10- PLACT 上显示我们G IM 的G 10- 机组列的模型的模型的模型上。