When Routers, Switches and Interconnects Compute: A processing-in-interconnect Paradigm for Scalable Neuromorphic AI

Routing, switching, and the interconnect fabric are essential components in implementing large-scale neuromorphic computing architectures. While this fabric plays only a supporting role in the process of computing, for large AI workloads, this fabric ultimately determines the overall system's performance, such as energy consumption and speed. In this paper, we offer a potential solution to address this bottleneck by addressing two fundamental questions: (a) What computing paradigms are inherent in existing routing, switching, and interconnect systems, and how can they be used to implement a Processing-in-Interconnect ($π^2$) computing paradigm? and (b) How to train $π^2$ network on standard AI benchmarks? To address the first question, we demonstrate that all operations required for typical AI workloads can be mapped onto delays, causality, time-outs, packet drops, and broadcast operations, all of which are already implemented in current packet-switching and packet-routing hardware. {We then show that existing buffering and traffic-shaping embedded algorithms can be minimally modified to implement $π^2$ neuron models and synaptic operations. To address the second question, we show how a knowledge distillation framework can be used to train and cross-map well-established neural network topologies onto $π^2$ architectures without any degradation in the generalization performance. Our analysis show that the effective energy utilization of a $π^2$ network is significantly higher than that of other neuromorphic computing platforms; as a result, we believe that the $π^2$ paradigm offers a more scalable architectural path toward achieving brain-scale AI inference.

翻译：路由、交换与互连结构是实现大规模神经形态计算架构的关键组成部分。尽管该结构在计算过程中仅承担辅助角色，但对于大规模AI工作负载而言，该结构最终决定了整个系统的性能表现，例如能耗与速度。本文通过探讨两个基本问题，提出了一种解决此瓶颈的潜在方案：（a）现有路由、交换与互连系统中固有的计算范式是什么？如何利用它们实现互连内处理（$π^2$）计算范式？（b）如何在标准AI基准上训练$π^2$网络？针对第一个问题，我们证明典型AI工作负载所需的所有操作均可映射到延迟、因果性、超时、丢包和广播操作——这些功能在当前分组交换与分组路由硬件中均已实现。我们进一步表明，通过对现有缓冲与流量整形嵌入式算法进行最小化修改，即可实现$π^2$神经元模型与突触操作。针对第二个问题，我们展示了如何利用知识蒸馏框架，将成熟的神经网络拓扑结构训练并交叉映射到$π^2$架构上，且不损失泛化性能。我们的分析表明，$π^2$网络的有效能量利用率显著高于其他神经形态计算平台；因此，我们认为$π^2$范式为实现脑规模AI推理提供了一条更具可扩展性的架构路径。