BandPilot: Towards Performance- and Contention-Aware GPU Dispatching in AI Clusters

Modern multi-tenant AI clusters are increasingly communication-bound, driven by high-volume and multi-round GPU-to-GPU collective communication. Consequently, the GPU dispatcher's choice of a physical GPU subset for each tenant largely determines the job's effective collective bandwidth and thus its performance ceiling. Existing dispatchers predominantly rely on static, topology-aware heuristics that prioritize GPU resource compactness, assuming that minimizing physical distance maximizes communication bandwidth. However, we reveal that this assumption often fails due to complex system-level bottlenecks, such as non-linear NIC saturation and inter-node link heterogeneity.This paper presents BandPilot, a performance- and contention-aware GPU dispatching primitive that optimizes effective collective bandwidth for multi-tenant AI clusters. Specifically, BandPilot learns a data-efficient bandwidth model from sparse NCCL measurements via a hierarchical design. Guided by the model, a fast hybrid search combines an equilibrium-driven constructor with a pruned elimination search to navigate the combinatorial allocation space in real time. To account for multi-tenant interference, BandPilot virtually merges a candidate allocation with co-located cross-host jobs to conservatively estimate shared bottleneck capacity and predict contention-degraded bandwidth. Across a 32-GPU H100 cluster and heterogeneous simulations, BandPilot achieves 92-97% bandwidth efficiency relative to the best-found reference, improving average efficiency by 20-40% over topology-compactness heuristics.

翻译：现代多租户AI集群日益受通信限制，其驱动力来自高流量、多轮次的GPU间集合通信。因此，调度器为每个租户选择物理GPU子集的决策，在很大程度上决定了作业的有效集合带宽，进而决定了其性能上限。现有调度器主要依赖静态的、拓扑感知的启发式方法，这些方法优先考虑GPU资源的紧凑性，其假设是：最小化物理距离可最大化通信带宽。然而，我们发现，由于复杂的系统级瓶颈（如非线性的NIC饱和与节点间链路异构性），这一假设常常失效。本文提出了BandPilot，一种面向性能与争用感知的GPU调度原语，旨在为多租户AI集群优化有效集合带宽。具体而言，BandPilot通过分层设计，从稀疏的NCCL测量中学习一个数据高效的带宽模型。在该模型的指导下，一种快速的混合搜索方法将均衡驱动的构造器与剪枝消除搜索相结合，以实时遍历组合分配空间。为了考虑多租户干扰，BandPilot将候选分配与共置的跨主机作业虚拟合并，以保守地估计共享瓶颈容量并预测因争用而降低的带宽。在32-GPU H100集群及异构模拟环境中，BandPilot实现了相对于最佳参考方案92-97%的带宽效率，相较于基于拓扑紧凑性的启发式方法，平均效率提升了20-40%。