TENT: A Declarative Slice Spraying Engine for Performant and Resilient Data Movement in Disaggregated LLM Serving

Feng Ren,Ruoyu Qin,Teng Ma,Shangming Cai,Zheng Liu,Chao Lei,Dejiang Zhu,Ke Yang,Zheming Li,Jialei Cui,Weixiao Huang,Yikai Zhao,Yineng Zhang,Hao Wu,Xiang Gao,Yuhao Fu,Jinlei Jiang,Yongwei Wu,Mingxing Zhang

Modern GPU clusters are built upon a complex hierarchy of heterogeneous interconnects, ranging from multi-rail RDMA to proprietary fabrics such as Multi-Node NVLink and Ascend UB. Orchestrating these diverse links effectively remains a critical challenge in disaggregated LLM serving. Operating Mooncake TE on thousands of GPUs exposed a critical limitation shared by existing frameworks: imperative, statically bound path selection. This rigidity forces engines to rely on state-blind striping that ignores congestion signals, creating communication silos, wasting multi-rail bandwidth due to head-of-line blocking, and leading to operational fragility where routine faults require manual intervention. We present TENT, a data-movement engine that decouples transfer intent from physical execution. Instead of locking workloads to fixed backends, TENT unifies heterogeneous interconnects into a single dynamic resource pool. Applications simply declare transfer intents, while TENT dynamically decomposes elephant flows into fine-grained slices and "sprays" them across links based on instantaneous link quality. This telemetry-driven orchestration eliminates head-of-line blocking and enables transparent, sub-50 ms self-healing by rerouting slices around failures without application logic. TENT serves as the production data plane for LLM inference and RL pipelines at multiple industrial sites. Our evaluation on H800 HGX clusters shows that TENT outperforms state-of-the-art baselines, including Mooncake TE, NIXL, and UCCL. In LLM inference with SGLang HiCache, TENT achieves up to 1.36x higher throughput and 26% lower P90 TTFT than Mooncake TE. In RL pipelines, TENT accelerates parameter updates in Moonshot Checkpoint Engine by 20-26%.

翻译：现代GPU集群构建于复杂的异构互连层级之上，涵盖从多轨RDMA到多节点NVLink和Ascend UB等专有架构。如何有效编排这些多样化链路，仍是解耦式LLM服务面临的关键挑战。在数千GPU上运行Mooncake TE的过程中，我们发现现有框架共同存在一个根本性局限：命令式的静态绑定路径选择策略。这种刚性迫使引擎依赖无视拥塞信号的状态盲式条带化，导致通信孤岛、因队头阻塞浪费多轨带宽，以及常规故障需人工干预的运维脆弱性。本文提出TENT——一种将传输意图与物理执行解耦的数据移动引擎。不同于将工作负载锁定至固定后端，TENT将异构互连统一为单一动态资源池。应用程序仅需声明传输意图，TENT则基于瞬时链路质量将大象流动态分解为细粒度切片，并在各链路间进行"喷射"。这种遥测驱动的编排消除了队头阻塞，并实现亚50毫秒透明的自愈能力——通过绕过故障链路重路由切片，无需应用层逻辑介入。TENT已在多个工业场景作为LLM推理和强化学习管线的生产级数据平面。在H800 HGX集群上的评估表明，TENT全面优于包括Mooncake TE、NIXL和UCCL在内的现有基准方案。在基于SGLang HiCache的LLM推理中，相较Mooncake TE，TENT吞吐量提升达1.36倍，P90 TTFT降低26%。在RL管线中，TENT使Moonshot检查点引擎的参数更新速度提升20-26%。