Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present fabric-lib, which bridges the functionality of common NICs to expose a uniform interface. fabric-lib exposes one-sided WriteImm operations with a ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU. We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We showcase fabric-lib through three production systems: (1) KvCache transfer for disaggregated inference with dynamic scaling, (2) RL weight updates achieving 1.3 seconds for trillion-parameter models, and (3) MoE dispatch/combine implementation exceeding DeepEP decode latency on ConnectX-7, with the first viable latencies on EFA. We demonstrate that our portable point-to-point communication complements collectives while avoiding lock-in. fabric-lib is open-sourced at https://github.com/perplexityai/pplx-garden/
翻译:新兴的大语言模型系统模式(如解耦推理、混合专家路由和异步强化学习微调)需要超越简单集合通信的灵活点对点通信。现有实现受限于特定网络接口控制器,阻碍了推理引擎集成及跨硬件供应商的可移植性。本文提出fabric-lib,通过桥接常见NIC功能以暴露统一接口。该库利用ImmCounter原语实现单侧WriteImm操作完成通知,在不依赖网络传输排序假设的情况下,透明管理每GPU多NIC。我们在NVIDIA ConnectX-7和AWS弹性结构适配器上均实现400 Gbps的峰值吞吐量。通过三个生产系统展示fabric-lib:(1)支持动态扩展解耦推理的KvCache传输,(2)实现万亿参数模型1.3秒的强化学习权重更新,(3)在ConnectX-7上超越DeepEP解码延迟的MoE调度/合并实现,并在EFA上获得首个可行延迟。我们证明这种可移植点对点通信方案在避免厂商锁定的同时,能与集合通信形成互补。fabric-lib已开源至https://github.com/perplexityai/pplx-garden/