Collective communication incurs significant overhead in LLM workloads. Although overlapping communication with computation in application-level is a common strategy, it often requires substantial code modifications and is impractical for many workloads (e.g., tensor and expert parallelism). We present CCCL, a built-in compression-based collective communication library that supports operations such as allreduce, alltoall, and send/recv without requiring any user-side changes, thereby enabling seamless adoption in existing applications. CCCL tightly fuses compression kernels to minimize memory accesses and integrates with NCCL to eliminate the data coalescing stage, making it fast enough (up to 3x NVLink bandwidth) to sustain communication. Our evaluation shows that CCCL improves end-to-end throughput in vLLM PD disaggregation workloads by up to 10.1% and microbenchmark throughput by up to 30%.
翻译:集体通信在大型语言模型(LLM)工作负载中产生显著开销。尽管在应用层将通信与计算重叠是常见策略,但通常需要大量代码修改,且对许多工作负载(如张量和专家并行)不可行。我们提出CCCL,一种内置压缩的集体通信库,支持allreduce、alltoall以及send/recv等操作,无需用户侧修改,从而能无缝应用于现有应用。CCCL紧密融合压缩内核以最小化内存访问,并与NCCL集成以消除数据合并阶段,使其足够快(最高达3倍NVLink带宽)以维持通信。实验评估表明,CCCL在vLLM PD分离工作负载中端到端吞吐量提升最高达10.1%,微基准测试吞吐量提升最高达30%。