The scale of LLM training jobs requires parallelization planning over large GPU clusters. Due to different GPU types and interconnects added over time, these GPU clusters are increasingly heterogeneous. Automatic LLM parallelizers can search for parallelization plans but face an exploding search space with heterogeneous GPUs. To make search tractable in heterogeneous GPU clusters, parallelizers often omit types of parallelism (e.g., expert parallelism) or memory-saving techniques (e.g., ZeRO), which results in worse plans. We describe Tangram, a system that enables the use of existing heterogeneity-unaware LLM parallelizers in heterogeneous GPU clusters by decoupling parallelization planning from GPU heterogeneity. For this, Tangram exploits two insights: (1) since bulk purchases result in sets of GPUs with similar compute, memory, and connectivity, Tangram can expose such homogeneous GPU islands to existing parallelizers; and (2) parallelizers commonly first partition models and then parallelize partitions. Tangram can compose such model slices, assigned to GPU islands, into work-balanced pipelines for high throughput. Tangram integrates with existing parallelizers through a narrow API, which relies on the enumeration of model-slice/island pairs. Tangram achieves up to 2.3x higher training throughput than current heterogeneous parallelizers (Metis and Sailor) and scales to large GPU clusters by pruning enumerated plans.
翻译:大语言模型训练任务的规模要求在大型GPU集群上进行并行化规划。由于随时间推移陆续添加了不同类型的GPU及互联技术,这些GPU集群正日益呈现异构化特征。自动大语言模型并行化工具可搜索并行化方案,但在异构GPU集群中面临搜索空间爆炸的问题。为使搜索在异构GPU集群中可行,现有并行化工具常省略某些并行类型(如专家并行)或内存节省技术(如ZeRO),导致生成的方案效率较低。本文介绍Tangram系统,该系统通过将并行化规划与GPU异构性解耦,使现有的异构无感知大语言模型并行化工具能直接应用于异构GPU集群。为此,Tangram利用两个关键发现:(1)由于批量采购导致同一批次GPU在计算能力、内存和互联性能上相似,Tangram可将这类同构GPU岛暴露给现有并行化工具;(2)并行化工具通常先划分模型再对各分区进行并行化。Tangram可将分配给不同GPU岛的模型切片组合成工作负载均衡的流水线,以实现高吞吐量。Tangram通过窄接口与现有并行化工具集成,该接口基于模型切片/GPU岛对的枚举机制。与现有异构并行化工具(Metis和Sailor)相比,Tangram的训练吞吐量可提升至2.3倍,并通过剪枝枚举方案扩展到大型GPU集群。