Core interface optimization for multi-core neuromorphic processors

Hardware implementations of Spiking Neural Networks (SNNs) represent a promising approach to edge-computing for applications that require low-power and low-latency, and which cannot resort to external cloud-based computing services. However, most solutions proposed so far either support only relatively small networks, or take up significant hardware resources, to implement large networks. To realize large-scale and scalable SNNs it is necessary to develop an efficient asynchronous communication and routing fabric that enables the design of multi-core architectures. In particular the core interface that manages inter-core spike communication is a crucial component as it represents the bottleneck of Power-Performance-Area (PPA) especially for the arbitration architecture and the routing memory. In this paper we present an arbitration mechanism with the corresponding asynchronous encoding pipeline circuits, based on hierarchical arbiter trees. The proposed scheme reduces the latency by more than 70% in sparse-event mode, compared to the state-of-the-art arbitration architectures, with lower area cost. The routing memory makes use of asynchronous Content Addressable Memory (CAM) with Current Sensing Completion Detection (CSCD), which saves approximately 46% energy, and achieves a 40% increase in throughput against conventional asynchronous CAM using configurable delay lines, at the cost of only a slight increase in area. In addition as it radically reduces the core interface resources in multi-core neuromorphic processors, the arbitration architecture and CAM architecture we propose can be also applied to a wide range of general asynchronous circuits and systems.

翻译：脉冲神经网络（SNNs）的硬件实现为需要低功耗、低延迟且无法依赖外部云计算服务的边缘计算应用提供了一种有前景的方案。然而，目前提出的大多数解决方案要么仅支持相对较小的网络，要么需要占用大量硬件资源来实现大型网络。为了实现大规模可扩展的SNNs，必须开发高效的异步通信与路由架构，从而支持多核芯片设计。其中，管理核间脉冲通信的核心接口是至关重要的组件，因为它代表了功耗-性能-面积（PPA）的瓶颈，尤其体现在仲裁架构和路由存储器方面。本文提出了一种基于分层仲裁树（hierarchical arbiter trees）的仲裁机制及其对应的异步编码流水线电路。与现有最先进的仲裁架构相比，所提出的方案在稀疏事件模式下将延迟降低了70%以上，且面积成本更低。路由存储器采用了基于电流检测完成检测（CSCD）的异步内容可寻址存储器（CAM），相比使用可配置延迟线的传统异步CAM，该设计节省了约46%的能量，吞吐量提升了40%，而面积仅略有增加。此外，由于所提出的仲裁架构与CAM架构从根本上减少了多核神经形态处理器中的核心接口资源，它们也可广泛应用于各类通用异步电路与系统。