In-network computing techniques, exemplified by NVLink Sharp (NVLS), offer a promising approach to addressing the communication bottlenecks in LLM inference by offloading collective operations, such as All-Reduce, to switches. However, the accelerator-centric architecture of NVLS suffers from two fundamental limitations: 1) it relies on GPU load instructions to trigger reduction operations, which means that the data reduced in the switch must be additionally transferred back to the initiating GPU rather than being broadcast directly, thereby introducing unnecessary communication overhead; 2) due to its architectural constraints, NVLS cannot offload operators that are not decomposable into memory-semantic instructions, such as the in-network quantization (INQ) proposed in this work. As a result, All-Reduce in NVLS must operate at FP16/BF16 precision, leading to substantial bandwidth waste.To address these limitations, we propose SCIN, the first switch-centric in-network architecture for shared-memory networks of AI accelerators, enabling both low-latency and high-bandwidth All-Reduce. Specifically, we introduce an in-switch accelerator (ISA) capable of initiating memory-semantic operations for in-network processing, together with a co-designed communication fabric that incurs negligible protocol overhead. By eliminating redundant data movement, SCIN delivers lower All-Reduce latency than NVLS. Moreover, by integrating a quantization module into the ISA, SCIN enables INQ for All-Reduce, reducing its precision to 8 bits and nearly doubling bandwidth with negligible accuracy loss. We also present a prototype of SCIN on a multi-FPGA system to demonstrate its feasibility and effectiveness. Experimental results show that our design accelerates All-Reduce by up to 8.7x for small messages and 3.8x for large messages, leading up to 1.74x faster TTFT and 1.34x faster TPOT on LLaMA-2 models.
翻译:网内计算技术(以NVLink Sharp为例)通过将All-Reduce等集合操作卸载至交换机,为缓解大语言模型推理中的通信瓶颈提供了可行方案。然而,NVLS以加速器为中心的架构存在两个根本性局限:1)其依赖GPU加载指令触发归约操作,导致交换机中归约的数据必须额外回传至发起端GPU而非直接广播,从而引入不必要的通信开销;2)受架构约束,NVLS无法卸载无法分解为内存语义指令的算子(如本文提出的网内量化)。因此,NVLS的All-Reduce必须以FP16/BF16精度运行,造成显著的带宽浪费。为克服上述局限,我们提出SCIN——首个面向AI加速器共享内存网络的以交换机为中心的网内架构,实现低延迟、高带宽的All-Reduce。具体而言,我们设计了具备发起内存语义操作能力的交换机内加速器(ISA),并协同设计了协议开销可忽略的通信架构。通过消除冗余数据移动,SCIN相比NVLS实现了更低的All-Reduce延迟。此外,通过在ISA中集成量化模块,SCIN实现了All-Reduce的网内量化(INQ)功能,将精度降至8位,在精度损失可忽略的前提下实现带宽近乎翻倍。我们在多FPGA系统上完成了SCIN原型验证,证明了其可行性与有效性。实验结果表明:该设计对小消息的All-Reduce加速比最高达8.7倍,对大消息加速比达3.8倍,在LLaMA-2模型上分别实现TTFT加速1.74倍、TPOT加速1.34倍。