NEURON-Fabric: CXL-Side Low-Bit Gradient Aggregation for Distributed Training

In large-model distributed training, especially large language model workloads, gradient All-Reduce increasingly stresses the memory and communication path. This paper asks whether a Compute Express Link (CXL) memory controller can aggregate low-bit gradient signals as gradient cache lines pass through it, while preserving a 32-bit floating-point (FP32) path for workloads, layers, or phases that should not use low-bit approximation. We present NEURON-Fabric, a CXL-side controller architecture that performs packed gradient-binary (G-Binary) sign-count aggregation and gradient-ternary (G-Ternary) gated aggregation near CXL memory, with a control interface for selecting low-bit or FP32 paths. Cycle-level timing experiments show that the measured five-cycle low-bit aggregation datapath adds at most 1.67 percent exposed runtime overhead in the full last-level-cache miss regime; under bandwidth pressure, the same compute stage is hidden by CXL service time. Functional tests confirm byte-exact identity read-back, G-Binary sign-count aggregation, and G-Ternary gating. Training checks quantify the communication and accuracy tradeoff: low-bit aggregation remains close to FP32 on CIFAR-10/ResNet-18 and SST-2/DistilBERT, while full-path low-bit aggregation fails on CIFAR-100/ResNet-18. Layer-aware admission identifies the classifier head as sensitive; keeping the head on FP32 while applying low-bit aggregation to the backbone recovers most accuracy and reduces gradient traffic to 3.6-5.4 percent of the FP32 baseline. Hardware synthesis and FPGA place-and-route estimates suggest that the 512-bit aggregation datapath is small enough to be treated as a near-memory datapath extension rather than a separate accelerator-scale block.

翻译：在大模型分布式训练中，尤其是大型语言模型工作负载下，梯度全规约（All-Reduce）日益加剧了内存与通信路径的压力。本文探讨Compute Express Link (CXL)内存控制器能否在梯度缓存行流经时聚合低比特梯度信号，同时为不适用低比特近似的工作负载、网络层级或训练阶段保留32位浮点(FP32)路径。我们提出NEURON-Fabric——一种CXL侧控制器架构，可在CXL内存附近执行打包梯度二进制(G-Binary)符号计数聚合与梯度三值(G-Ternary)门控聚合，并配备用于选择低比特或FP32路径的控制接口。周期级时序实验表明，在完整末级缓存缺失场景下，所测得的五周期低比特聚合数据通路额外运行时开销最多为1.67%；在带宽压力下，该计算阶段可被CXL服务时间完全掩盖。功能验证确认了字节精确一致回读、G-Binary符号计数聚合及G-Ternary门控功能。训练测试量化了通信与精度权衡：在CIFAR-10/ResNet-18和SST-2/DistilBERT上，低比特聚合接近FP32性能，而全路径低比特聚合在CIFAR-100/ResNet-18上失效。层级感知准入控制识别出分类器头部为敏感区域：保留头部采用FP32，对骨干网络应用低比特聚合，可恢复大部分精度，并将梯度流量降至FP32基线的3.6%-5.4%。硬件综合与FPGA布局布线估算表明，512位聚合数据通路足够精简，可视为近内存数据通路扩展而非独立的加速器级模块。