The exponential increase in Machine Learning (ML) model size and complexity has driven unprecedented demand for high-performance acceleration systems. As technology scaling enables the integration of thousands of computing elements onto a single die, the boundary between distributed and on-chip systems has blurred, making efficient on-chip collective communication increasingly critical. In this work, we present a lightweight, collective-capable Network on Chip (NoC) that supports efficient barrier synchronization alongside scalable, high-bandwidth multicast and reduction operations, co-designed for the next generation of ML accelerators. We introduce Direct Compute Access (DCA), a novel paradigm that grants the interconnect fabric direct access to the cores' computational resources, enabling high-throughput in-network reductions with a small 16.5% router area overhead. Through in-network hardware acceleration, we achieve 2.9x and 2.5x geomean speedups on multicast and reduction operations involving between 1 and 32 KiB of data, respectively. Furthermore, by keeping communication off the critical path in GEMM workloads, these features allow our architecture to scale efficiently to large meshes, resulting in up to 3.8x and 2.4x estimated performance gains through multicast and reduction support, respectively, compared to a baseline unicast NoC architecture, and up to 1.17x estimated energy savings.
翻译:机器学习模型规模和复杂度的指数级增长,催生了对高性能加速系统的空前需求。随着工艺尺寸微缩使数千个计算单元能够集成于单一芯片,分布式系统与片上系统的界限逐渐模糊,这使得高效的片上集合通信变得愈发关键。本研究提出一种轻量级、支持集合通信的片上网络(NoC),该架构在为下一代ML加速器进行协同设计的同时,可支持高效的屏障同步以及可扩展的高带宽多播与规约操作。我们引入直接计算访问(DCA)这一新型范式,允许互连结构直接访问核心的计算资源,从而以仅16.5%的路由器面积开销实现高吞吐量的网内规约。通过网内硬件加速,我们针对1至32 KiB数据量的多播与规约操作分别实现了2.9倍与2.5倍的几何平均加速比。此外,在GEMM工作负载中通过将通信置于关键路径之外,这些特性使我们的架构能够高效扩展至大规模网格,相较于基线单播NoC架构,多播与规约支持分别带来最高3.8倍与2.4倍的预估性能提升,同时实现最高1.17倍的预估能耗节省。