Efficiently training large-scale models (LMs) in GPU clusters involves two separate avenues: inter-job dynamic scheduling and intra-job adaptive parallelism (AP). However, existing dynamic schedulers struggle with large-model scheduling due to the mismatch between static parallelism (SP)-aware scheduling and AP-based execution, leading to cluster inefficiencies such as degraded throughput and prolonged job queuing. This paper presents Arena, a large-model training system that co-designs dynamic scheduling and adaptive parallelism to achieve high cluster efficiency. To reduce scheduling costs while improving decision quality, Arena designs low-cost, disaggregated profiling and AP-tailored, load-aware performance estimation, while unifying them by sharding the joint scheduling-parallelism optimization space via a grid abstraction. Building on this, Arena dynamically schedules profiled jobs in elasticity and heterogeneity dimensions, and executes them using efficient AP with pruned search space. Evaluated on heterogeneous testbeds and production workloads, Arena reduces job completion time by up to $49.3\%$ and improves cluster throughput by up to $1.60\times$.
翻译:在GPU集群中高效训练大规模模型(LMs)涉及两个独立方向:任务间动态调度与任务内自适应并行(AP)。然而,现有动态调度器在调度大模型时面临困难,原因是基于静态并行(SP)感知的调度与基于AP的执行之间存在不匹配,导致集群效率低下,如吞吐量下降和任务排队时间延长。本文提出Arena——一个通过协同设计动态调度与自适应并行来实现高集群效率的大模型训练系统。为在提升决策质量的同时降低调度成本,Arena设计了低开销、分离式性能画像以及面向AP、负载感知的性能估计,并通过网格抽象将联合调度-并行优化空间分片以统一两者。基于此,Arena在弹性与异构性维度上动态调度完成画像的任务,并利用剪枝搜索空间的高效AP执行这些任务。在异构测试平台与生产工作负载上的评估显示,Arena可将任务完成时间降低最高49.3%,并将集群吞吐量提升最高1.60倍。