In-network computing techniques, exemplified by NVLink SHARP (NVLS), offer a promising approach to addressing the communication bottlenecks in LLM inference by offloading collective operations such as All-Reduce to switches. However, the accelerator-centric architecture of NVLS suffers from two fundamental limitations: 1) it relies on GPU load instructions to trigger in-switch reduction, which means that the data reduced in the switch must be transferred back to the initiating GPU rather than being broadcast directly, thereby introducing unnecessary communication overhead; 2) due to its architectural constraints, NVLS cannot offload operators that are not decomposable into memory-semantic instructions, such as the in-network quantization (INQ) proposed in this work. As a result, All-Reduce in NVLS during inference still operates at 16-bit precision, leading to substantial bandwidth waste. To address these limitations, we propose SCIN, the first switch-centric in-network architecture for multi-accelerator shared-memory networks, enabling both low-latency and high-bandwidth All-Reduce. Specifically, we introduce an in-switch accelerator (ISA) capable of directly accessing the memory regions in attached accelerators for in-network processing, together with a co-designed communication fabric that enables such access with negligible protocol overhead. SCIN delivers lower All-Reduce latency than NVLS by eliminating redundant data movement. Moreover, SCIN enables INQ for All-Reduce, reducing its precision to 8 bits and nearly doubling bandwidth with negligible accuracy loss. We also present a multi-FPGA prototype of SCIN to validate its feasibility and effectiveness. Simulation results for an 8-GPU system show that our design accelerates All-Reduce by up to 8.7x for small messages and 3.8x for large messages, yielding up to 1.74x TTFT speedup and 1.34x TPOT speedup on LLaMA-2 models.
翻译:网络内计算技术(以NVLink SHARP(NVLS)为代表)通过将All-Reduce等集合通信操作卸载至交换机,为解决大语言模型(LLM)推理中的通信瓶颈提供了有效途径。然而,NVLS的加速器中心架构存在两个根本性局限:1)其依赖GPU加载指令触发交换机内归约,导致交换机内归约后的数据必须回传至发起GPU,而非直接广播,从而引入不必要通信开销;2)受架构约束,NVLS无法卸载不可分解为内存语义指令的算子(如本文提出的网络内量化(INQ))。因此,NVLS在推理期间的All-Reduce仍采用16位精度,造成显著带宽浪费。针对这些局限,我们提出SCIN——面向多加速器共享内存网络的首个以交换机为中心的网络内架构,可实现低延迟与高带宽的All-Reduce。具体而言,我们引入可直接访问附属加速器内存区域进行网络内处理的交换机内加速器(ISA),并协同设计通信框架以极低协议开销实现此类访问。通过消除冗余数据移动,SCIN的All-Reduce延迟低于NVLS。此外,SCIN支持All-Reduce的INQ操作,将其精度降至8位并在几乎无精度损失情况下使带宽近乎翻倍。我们亦构建了多FPGA原型系统以验证SCIN的可行性与有效性。对8-GPU系统的仿真结果表明,本方案对小消息的All-Reduce加速比达8.7倍,对大规模消息加速比达3.8倍,在LLaMA-2模型上分别实现最高1.74倍TTFT加速与1.34倍TPOT加速。