The rapid evolution of large language models (LLMs) has made geographically distributed training necessary due to GPU scarcity within a single cloud region. In such cross-region settings, Pipeline Parallelism (PP) is communication-efficient, yet scheduling PP remains challenging under heterogeneous inter-region bandwidth and regional electricity prices. Existing schedulers are either delay-first, incurring high electricity cost, or cost-first, relying on rigid resource allocation that prolongs Job Completion Time (JCT). They are also ineffective at optimizing execution order in multi-tenant environments, where long-running and bandwidth-intensive jobs can cause head-of-line (HoL) blocking and degrade overall performance. To this end, we propose BACE-Pipe, a bandwidth-aware and cost-efficient pipeline scheduling framework for LLM training across geo-distributed clusters. BACE-Pipe first introduces a dynamic job prioritization mechanism that optimizes execution order by jointly considering job characteristics (e.g., computation time) and real-time network utilization. It then employs a bandwidth-aware pathfinder to identify feasible cross-region pipeline paths that satisfy communication constraints, thereby preventing communication from stalling the pipeline. Among all feasible paths, a cost-minimizing allocator determines the optimal GPU placement strategy by preferentially assigning resources to regions with lower electricity prices. Consequently, BACE-Pipe mitigates HoL blocking, improves resource utilization, and simultaneously reduces both JCT and total electricity cost. Extensive simulations show that BACE-Pipe reduces average JCT by 27.9%--64.7% and total electricity cost by 12.6%--30.6% compared with state-of-the-art baselines.
翻译:大语言模型的快速发展使得单一云区域内的GPU资源稀缺,从而催生了地理分布式训练的需求。在这种跨区域场景下,流水线并行(PP)具有通信高效的优势,但跨区域异构带宽和区域电价差异给其调度带来了挑战。现有调度器要么是延迟优先(导致高额电费),要么是成本优先(依赖僵化的资源分配而延长作业完成时间)。此外,在多租户环境中,长时间运行且带宽密集型的作业可能引发队头阻塞,降低整体性能,而现有方法也无法有效优化执行顺序。为此,我们提出了BACE-Pipe——一种面向地理分布式集群大语言模型训练的带宽感知与成本高效流水线调度框架。BACE-Pipe首先引入动态作业优先级机制,通过联合考虑作业特性(如计算时间)和实时网络利用率来优化执行顺序;随后采用带宽感知路径搜索器,识别满足通信约束的可行跨区域流水线路径,从而避免通信阻塞流水线。在所有可行路径中,成本最小化分配器通过优先将资源分配给电价较低的区域,确定最优GPU放置策略。因此,BACE-Pipe能够缓解队头阻塞、提升资源利用率,并同时降低作业完成时间和总电费。大量仿真结果表明,与最先进的基线方法相比,BACE-Pipe可将平均作业完成时间降低27.9%–64.7%,总电费降低12.6%–30.6%。