MSCCL++：重新思考面向AI推理的GPU通信抽象 (MSCCL++: Rethinking GPU Communication Abstractions for AI Inference)

Changho Hwang,Peng Cheng,Roshan Dathathri,Abhinav Jangda,Saeed Maleki,Madan Musuvathi,Olli Saarikivi,Aashaka Shah,Ziyue Yang,Binyang Li,Caio Rocha,Qinghua Zhou,Mahdieh Ghazimirsaeed,Sreevatsa Anantharamu,Jithin Jose

from arxiv, 15 pages, 13 figures

AI applications increasingly run on fast-evolving, heterogeneous hardware to maximize performance, but general-purpose libraries lag in supporting these features. Performance-minded programmers often build custom communication stacks that are fast but error-prone and non-portable. This paper introduces MSCCL++, a design methodology for developing high-performance, portable communication kernels. It provides (1) a low-level, performance-preserving primitive interface that exposes minimal hardware abstractions while hiding the complexities of synchronization and consistency, (2) a higher-level DSL for application developers to implement workload-specific communication algorithms, and (3) a library of efficient algorithms implementing the standard collective API, enabling adoption by users with minimal expertise. Compared to state-of-the-art baselines, MSCCL++ achieves geomean speedups of $1.7\times$ (up to $5.4\times$) for collective communication and $1.2\times$ (up to $1.38\times$) for AI inference workloads. MSCCL++ is in production of multiple AI services provided by Microsoft Azure, and has also been adopted by RCCL, the GPU collective communication library maintained by AMD. MSCCL++ is open source and available at https://github.com/microsoft/mscclpp . Our two years of experience with MSCCL++ suggests that its abstractions are robust, enabling support for new hardware features, such as multimem, within weeks of development.

翻译：人工智能应用日益运行在快速演进、异构的硬件上以最大化性能，但通用库在支持这些特性方面进展滞后。注重性能的程序员常常构建自定义的通信栈，这些栈虽然快速，但容易出错且不可移植。本文介绍了MSCCL++，一种用于开发高性能、可移植通信内核的设计方法论。它提供：（1）一个低层级、能保持性能的基元接口，该接口暴露最少的硬件抽象，同时隐藏同步与一致性的复杂性；（2）一个供应用开发者实现特定工作负载通信算法的更高级领域特定语言（DSL）；（3）一个实现了标准集体通信API的高效算法库，使得专业知识有限的用户也能轻松采用。与最先进的基线相比，MSCCL++在集体通信上实现了$1.7\times$（最高$5.4\times$）的几何平均加速，在AI推理工作负载上实现了$1.2\times$（最高$1.38\times$）的几何平均加速。MSCCL++已应用于微软Azure提供的多项AI服务的生产环境，并且已被AMD维护的GPU集体通信库RCCL所采纳。MSCCL++是开源的，可在 https://github.com/microsoft/mscclpp 获取。我们使用MSCCL++两年的经验表明，其抽象是稳健的，能够在数周的开发周期内支持新的硬件特性，例如多内存（multimem）。

相关内容

关注 7093

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

多智能体通信：多智能体强化学习到涌现语言和大语言模型的综述

专知会员服务

11+阅读 · 2月13日

视觉语义通信综述：分类体系、体系架构、关键赋能技术及应用现状

专知会员服务

17+阅读 · 2月2日

大语言模型驱动的AI智能体通信综述：协议、安全风险与防御对策

专知会员服务

29+阅读 · 2025年6月25日

《基于MCP的软件设计模式视角下的大型语言模型智能体通信研究综述》

专知会员服务

44+阅读 · 2025年6月9日