Distributed machine learning has become increasingly important due to the massive scale of large-scale generative models. Both model parameters and data are distributed across many compute devices, which requires frequent collective communications to synchronize activations and parameter updates. Such collective communications have become a major bottleneck. While the performance of the collective algorithm depends on the physical network topology, the baseline collective algorithms in collective communication libraries are largely topology-agnostic. Collective algorithm synthesizers address this inefficiency by automatically generating topology-aware collective algorithms. However, prior works have largely overlooked that collective communication typically occurs only among a subset of devices, known as process groups. Additionally, most existing synthesizers are limited in the range of target collective patterns they can generate. We propose PCCL, a scalable and generic framework for synthesizing topology-aware collective algorithms. PCCL is process group-aware and capable of generating near-optimal collective algorithms even when only a subset of devices participates in collective operations. PCCL synthesizes arbitrary collective patterns, including 512-NPU All-to-All synthesis in 11.68 minutes.
翻译:分布式机器学习因大规模生成模型的庞大规模而日益重要。模型参数与数据分布在众多计算设备上,这需要频繁的集合通信来同步激活值与参数更新。此类集合通信已成为主要瓶颈。尽管集合算法的性能取决于物理网络拓扑,但集合通信库中的基准集合算法在很大程度上是拓扑无关的。集合算法合成器通过自动生成拓扑感知的集合算法来解决这一低效问题。然而,先前工作大多忽视了集合通信通常仅在称为进程组的设备子集内发生。此外,现有合成器在可生成的目标集合模式范围上存在局限。我们提出PCCL,一个用于合成拓扑感知集合算法的可扩展通用框架。PCCL具有进程组感知能力,即使只有部分设备参与集合操作,也能生成接近最优的集合算法。PCCL可合成任意集合模式,包括在11.68分钟内完成512-NPU的全对全合成。