MonkeyTree: Near-Minimal Congestion for Multi-tenant Training via Migration

We present MonkeyTree, the first system to mitigate network congestion in multi-tenant GPU clusters through job-migration based defragmentation rather than network-layer techniques. As cloud operators co-locate ML training jobs on shared, oversubscribed networks, congestion degrades training throughput for over a third of jobs. Prior approaches either rely on routing and flow scheduling--which we show have fundamental limits when traffic exceeds capacity--or require costly full-bisection bandwidth topologies with packet spraying. MonkeyTree exploits characteristics of ML training traffic: ring-based collectives generate exactly one cross-rack flow per rack a job spans, making congestion-free placements achievable. The sparse constraint structure admits abundant valid configurations, making them easy to reach with few migrations. Once reached, low fragmentation is self-reinforcing, as new arrivals disturb only a few racks. MonkeyTree formulates defragmentation as an integer linear program that minimizes worker movements, subject to per-rack fragmentation bounds. We prove a tight bound showing any placement can be defragmented to at most two cross-rack fragments per ToR, and extend the formulation to hybrid parallelism with multiple rings per server. Migration is implemented via in-memory checkpoint-and-restore over RDMA, incurring only 9.02 seconds of system overhead end-to-end per worker. We evaluate MonkeyTree using a custom simulator modeling clusters of up to 2,048 H200 GPUs and prototype on a five-node A100 testbed. MonkeyTree improves average job completion time by 14 percent over the next best baseline on a cluster of 1,024 GPUs with a 4:1 oversubscription. With a high 16:1 oversubscription ratio and 2,048 GPUs, MonkeyTree keeps p99 job completion time within 5 percent of ideal.

翻译：本文提出MonkeyTree，这是首个通过基于作业迁移的碎片整理（而非网络层技术）来缓解多租户GPU集群网络拥塞的系统。随着云运营商将机器学习训练作业共置于共享且过载的网络上，超过三分之一的作业因拥塞导致训练吞吐量下降。现有方法要么依赖路由和流调度（我们证明当流量超过容量时这些方法存在根本性限制），要么需要采用数据包喷洒技术的高成本全二分带宽拓扑结构。MonkeyTree利用ML训练流量的特性：基于环的集合通信恰好为作业跨越的每个机架生成一条跨机架流，这使得实现无拥塞布局成为可能。稀疏的约束结构允许大量有效配置，通过少量迁移即可轻松达成目标。一旦达成，低碎片化状态具有自增强特性，因为新到达的作业仅影响少数机架。MonkeyTree将碎片整理问题建模为整数线性规划，在满足每机架碎片化边界约束的前提下最小化工作节点迁移次数。我们证明了一个紧确界：任何布局均可被整理至每个ToR最多两个跨机架碎片，并将该模型扩展至支持每服务器多环的混合并行场景。迁移通过基于RDMA的内存检查点保存与恢复机制实现，每个工作节点的端到端系统开销仅为9.02秒。我们使用模拟2048个H200 GPU集群的自定义仿真器进行评估，并在五节点A100测试平台上进行原型验证。在1024个GPU且过载比为4:1的集群中，MonkeyTree相比次优基线将作业平均完成时间缩短了14%。在2048个GPU且过载比高达16:1的场景下，MonkeyTree能将p99作业完成时间控制在理想值的5%以内。