Tensor parallelism (TP) in large-scale LLM inference and training introduces frequent collective operations that dominate inter-GPU communication. While in-switch computing, exemplified by NVLink SHARP (NVLS), accelerates collective operations by reducing redundant data transfer, its communication-centric design philosophy introduces the mismatch between its communication mode and the memory semantic requirement of LLM's computation kernel. Such a mismatch isolates the compute and communication phases, resulting in underutilized resources and limited overlap in multi-GPU systems. To address the limitation, we propose CAIS, the first Compute-Aware In-Switch computing framework that aligns communication modes with computation's memory semantics requirement. CAIS consists of three integral techniques: (1) compute-aware ISA and microarchitecture extension to enable compute-aware in-switch computing. (2) merging-aware TB (Thread Block) coordination to improve the temporal alignment for efficient request merging. (3) graph-level dataflow optimizer to achieve a tight cross-kernel overlap. Evaluations on LLM workloads show that CAIS achieves 1.38$\times$ average end-to-end training speedup over the SOTA NVLS-enabled solution, and 1.61$\times$ over T3, the SOTA compute-communicate overlap solutions but do not leverage NVLS, demonstrating its effectiveness in accelerating TP on multi-GPU systems.
翻译:在大规模大语言模型(LLM)推理与训练中,张量并行(TP)会引入频繁的集合操作,主导了GPU间通信。尽管以NVLink SHARP(NVLS)为代表的交换机内计算通过减少冗余数据传输加速了集合操作,但其以通信为中心的设计理念导致通信模式与LLM计算内核的内存语义需求之间存在失配。这种失配使得计算与通信阶段相互隔离,导致多GPU系统中资源利用率不足,重叠受限。为突破该局限,我们提出CAIS——首个感知计算型的交换机内计算框架,可将通信模式与计算的内存语义需求对齐。CAIS包含三项核心技术:(1)感知计算型的指令集架构与微架构扩展,以实现感知计算型的交换机内计算;(2)融合感知的线程块(TB)协调机制,提升高效请求融合的时间对齐性;(3)图级数据流优化器,实现紧密的跨内核重叠。在LLM工作负载上的评估表明,相比采用最先进NVLS技术的方案,CAIS可实现端到端训练平均1.38倍加速;相比未利用NVLS的最先进计算-通信重叠方案T3,加速比可达1.61倍,充分验证了其在多GPU系统上加速张量并行训练的有效性。