Pipette: Automatic Fine-grained Large Language Model Training Configurator for Real-World Clusters

Training large language models (LLMs) is known to be challenging because of the huge computational and memory capacity requirements. To address these issues, it is common to use a cluster of GPUs with 3D parallelism, which splits a model along the data batch, pipeline stage, and intra-layer tensor dimensions. However, the use of 3D parallelism produces the additional challenge of finding the optimal number of ways on each dimension and mapping the split models onto the GPUs. Several previous studies have attempted to automatically find the optimal configuration, but many of these lacked several important aspects. For instance, the heterogeneous nature of the interconnect speeds is often ignored. While the peak bandwidths for the interconnects are usually made equal, the actual attained bandwidth varies per link in real-world clusters. Combined with the critical path modeling that does not properly consider the communication, they easily fall into sub-optimal configurations. In addition, they often fail to consider the memory requirement per GPU, often recommending solutions that could not be executed. To address these challenges, we propose Pipette, which is an automatic fine-grained LLM training configurator for real-world clusters. By devising better performance models along with the memory estimator and fine-grained individual GPU assignment, Pipette achieves faster configurations that satisfy the memory constraints. We evaluated Pipette on large clusters to show that it provides a significant speedup over the prior art. The implementation of Pipette is available at https://github.com/yimjinkyu1/date2024_pipette.

翻译：大语言模型（LLM）的训练因其巨大的计算与内存容量需求而极具挑战性。为解决这些问题，通常采用具备三维并行性的GPU集群，该技术将模型沿数据批次、流水线阶段和层内张量维度进行划分。然而，三维并行的使用带来了额外的挑战：需要在每个维度上确定最优划分数量，并将划分后的模型映射到GPU上。先前已有若干研究尝试自动寻找最优配置，但其中许多工作忽略了若干重要方面。例如，互连速度的异构性常被忽视。尽管互连的峰值带宽通常设计为相等，但在真实集群中，每条链路实际达到的带宽各不相同。加之未能恰当考虑通信开销的关键路径建模，这些方法容易陷入次优配置。此外，它们往往未能考虑每个GPU的内存需求，经常推荐无法执行的解决方案。为应对这些挑战，我们提出了Pipette，一种面向真实集群的细粒度LLM训练自动配置器。通过设计更优的性能模型，并结合内存估计器与细粒度的单GPU分配策略，Pipette能够生成满足内存约束且执行更快的配置方案。我们在大型集群上评估了Pipette，结果表明其相比现有技术实现了显著的加速。Pipette的实现可在 https://github.com/yimjinkyu1/date2024_pipette 获取。