Reducing collective communication latency is a critical goal for large model training and inference in both academia and industry. Many-to-many communications, such as AllGather and AlltoAll (dispatch), are core components of modern parallelization strategies. State-of-the-art implementations of these communications rely on unicast-based writes and transmit duplicate copies of the same data across physical links for multiple receivers. This redundant transmission congests network bottlenecks and degrades end-to-end latency. We present MultiWrite, a novel many-to-many transmission semantic that eliminates redundant packets to directly reduce operator latency. MultiWrite adopts multicast principles while addressing critical limitations of traditional multicast for AI workloads. These limitations include heavy management plane overhead and ecosystem compatibility issues. We implement MultiWrite on Ascend NPUs. Long-term stress tests demonstrate that our MultiWrite-based operators achieve up to 33% latency reduction on commercially deployed devices.
翻译:减少集合通信延迟是学术界和工业界在大规模模型训练与推理中的关键目标。诸如AllGather和AllToAll(分发)等多对多通信是现代并行化策略的核心组件。这些通信的最新实现依赖于基于单播的写入机制,并在物理链路上为多个接收方传输相同数据的重复副本。这种冗余传输会堵塞网络瓶颈并降低端到端延迟。我们提出MultiWrite——一种新型多对多传输语义,通过消除冗余数据包直接降低算子延迟。MultiWrite借鉴多播原理,同时解决了传统多播在AI工作负载中的关键缺陷,包括沉重的管理平面开销和生态系统兼容性问题。我们在昇腾NPU上实现了MultiWrite。长期压力测试表明,基于MultiWrite的算子在商用部署设备上实现了最高33%的延迟降低。