Routing, switching, and the interconnect fabric are essential for large-scale neuromorphic computing. While this fabric only plays a supporting role in the process of computing, for large AI workloads it ultimately determines energy consumption and speed. In this paper, we address this bottleneck by asking: (a) What computing paradigms are inherent in existing routing, switching, and interconnect systems, and how can they be used to implement a processing-in-Interconnect (π^2) computing paradigm? and (b) leveraging current and future interconnect trends, how will a π^2 system's performance scale compared to other neuromorphic architectures? For (a), we show that operations required for typical AI workloads can be mapped onto delays, causality, time-outs, packet drop, and broadcast operations -- primitives already implemented in packet-switching and packet-routing hardware. We show that existing buffering and traffic-shaping embedded algorithms can be leveraged to implement neuron models and synaptic operations. Additionally, a knowledge-distillation framework can train and cross-map well-established neural network topologies onto $π^2$ without degrading generalization performance. For (b), analytical modeling shows that, unlike other neuromorphic platforms, the energy scaling of $π^2$ improves with interconnect bandwidth and energy efficiency. We predict that by leveraging trends in interconnect technology, a π^2 architecture can be more easily scaled to execute brain-scale AI inference workloads with power consumption levels in the range of hundreds of watts.
翻译:路由、交换与互连架构是大规模神经形态计算的关键基础。尽管该架构在计算过程中仅承担辅助角色,但对于大规模AI工作负载而言,它最终决定着系统的能耗与速度。本文通过探讨以下问题来解决这一瓶颈:(a) 现有路由、交换及互连系统中固有的计算范式是什么?如何利用它们实现互连内处理(π^2)计算范式?(b) 基于当前及未来的互连技术发展趋势,π^2系统相较于其他神经形态架构将如何实现性能扩展?针对问题(a),我们证明典型AI工作负载所需的运算可映射到延迟、因果性、超时、丢包和广播操作——这些原语已在分组交换与分组路由硬件中实现。我们进一步展示可利用现有的缓冲与流量整形嵌入式算法来实现神经元模型与突触操作。此外,通过知识蒸馏框架能够在不降低泛化性能的前提下,将成熟的神经网络拓扑训练并交叉映射到π^2架构上。针对问题(b),分析建模表明:与其他神经形态平台不同,π^2的能量缩放特性会随着互连带宽与能效的提升而改善。我们预测,通过利用互连技术的发展趋势,π^2架构能更便捷地扩展至执行大脑尺度的AI推理工作负载,其功耗水平可控制在数百瓦量级。