AI applications increasingly run on fast-evolving, heterogeneous hardware to maximize performance, but general-purpose libraries lag in supporting these features. Performance-minded programmers often build custom communication stacks that are fast but error-prone and non-portable. This paper introduces MSCCL++, a design methodology for developing high-performance, portable communication kernels. It provides (1) a low-level, performance-preserving primitive interface that exposes minimal hardware abstractions while hiding the complexities of synchronization and consistency, (2) a higher-level DSL for application developers to implement workload-specific communication algorithms, and (3) a library of efficient algorithms implementing the standard collective API, enabling adoption by users with minimal expertise. Compared to state-of-the-art baselines, MSCCL++ achieves geomean speedups of $1.7\times$ (up to $5.4\times$) for collective communication and $1.2\times$ (up to $1.38\times$) for AI inference workloads. MSCCL++ is in production of multiple AI services provided by Microsoft Azure, and has also been adopted by RCCL, the GPU collective communication library maintained by AMD. MSCCL++ is open source and available at https://github.com/microsoft/mscclpp . Our two years of experience with MSCCL++ suggests that its abstractions are robust, enabling support for new hardware features, such as multimem, within weeks of development.
翻译:人工智能应用日益运行在快速演进、异构的硬件上以最大化性能,但通用库在支持这些特性方面进展滞后。注重性能的程序员常常构建自定义的通信栈,这些栈虽然快速,但容易出错且不可移植。本文介绍了MSCCL++,一种用于开发高性能、可移植通信内核的设计方法论。它提供:(1)一个低层级、能保持性能的基元接口,该接口暴露最少的硬件抽象,同时隐藏同步与一致性的复杂性;(2)一个供应用开发者实现特定工作负载通信算法的更高级领域特定语言(DSL);(3)一个实现了标准集体通信API的高效算法库,使得专业知识有限的用户也能轻松采用。与最先进的基线相比,MSCCL++在集体通信上实现了$1.7\times$(最高$5.4\times$)的几何平均加速,在AI推理工作负载上实现了$1.2\times$(最高$1.38\times$)的几何平均加速。MSCCL++已应用于微软Azure提供的多项AI服务的生产环境,并且已被AMD维护的GPU集体通信库RCCL所采纳。MSCCL++是开源的,可在 https://github.com/microsoft/mscclpp 获取。我们使用MSCCL++两年的经验表明,其抽象是稳健的,能够在数周的开发周期内支持新的硬件特性,例如多内存(multimem)。