Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present TransferEngine, which bridges the functionality of common NICs to expose a uniform interface. TransferEngine exposes one-sided WriteImm operations with a ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU. We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We showcase TransferEngine through three production systems: (1) KvCache transfer for disaggregated inference with dynamic scaling, (2) RL weight updates achieving 1.3 seconds for trillion-parameter models, and (3) MoE dispatch/combine implementation exceeding DeepEP decode latency on ConnectX-7, with the first viable latencies on EFA. We demonstrate that our portable point-to-point communication complements collectives while avoiding lock-in.
翻译:新兴的大语言模型系统范式,例如解耦推理、专家混合路由和异步强化微调,需要超越简单集合操作、更为灵活的点对点通信。现有实现通常锁定于特定的网络接口控制器,阻碍了其与推理引擎的集成以及跨硬件供应商的移植性。本文提出了TransferEngine,它桥接了通用网络接口控制器的功能,以提供统一的接口。TransferEngine通过ImmCounter原语暴露单边WriteImm操作以完成通知,无需依赖网络传输的排序假设,并透明地管理每个GPU上的多个网络接口控制器。我们在NVIDIA ConnectX-7和AWS弹性结构适配器上均实现了400 Gbps的峰值吞吐量。我们通过三个生产系统展示了TransferEngine的应用:(1) 用于支持动态扩展的解耦推理的KvCache传输;(2) 在万亿参数模型上实现1.3秒的强化学习权重更新;(3) 在ConnectX-7上超越DeepEP解码延迟的专家混合调度/聚合实现,并在EFA上首次实现了可行的延迟。我们证明,我们这种可移植的点对点通信方式能够与集合操作形成互补,同时避免了供应商锁定。