Simulation code for conventional supercomputers serves as a reference for neuromorphic computing systems. The present bottleneck of distributed large-scale spiking neuronal network simulations is the communication between compute nodes. Communication speed seems limited by the interconnect between the nodes and the software library orchestrating the data transfer. Profiling reveals, however, that the variability of the time required by the compute nodes between communication calls is large. The bottleneck is in fact the waiting time for the slowest node. A statistical model explains total simulation time on the basis of the distribution of computation times between communication calls. A fundamental cure is to avoid communication calls because this requires fewer synchronizations and reduces the variability of computation times across compute nodes. The organization of the mammalian brain into areas lends itself to such an optimization strategy. Connections between neurons within an area have short delays, but the delays of the long-range connections across areas are an order of magnitude longer. This suggests a structure-aware mapping of areas to compute nodes allowing for a partition into more frequent communication between nodes simulating a particular area and less frequent global communication. We demonstrate a substantial performance gain on a real-world example. This work proposes a local-global hybrid communication architecture for large-scale neuronal network simulations as a first step in mapping the structure of the brain to the structure of a supercomputer. It challenges the long-standing belief that the bottleneck of simulation is synchronization inherent in the collective calls of standard communication libraries. We provide guidelines for the energy efficient simulation of neuronal networks on conventional computing systems and raise the bar for neuromorphic systems.
翻译:传统超级计算机的仿真代码可作为神经形态计算系统的参考基准。当前分布式大规模脉冲神经网络仿真的瓶颈在于计算节点间的通信。通信速度似乎受限于节点间的互连结构及协调数据传输的软件库。然而,性能分析显示,计算节点在通信调用间隔所需时间的变异性极大。实际瓶颈在于等待最慢节点的耗时。基于通信调用间隔计算时间的分布规律,我们构建了一个统计模型来解释总仿真时间。根本解决途径在于避免通信调用,因为这能减少同步需求并降低计算节点间计算时间的变异性。哺乳动物大脑按脑区划分的组织结构为此优化策略提供了天然基础:脑区内神经元连接具有短延迟,而跨脑区长程连接的延迟则高出一个数量级。这启示我们采用结构感知的脑区-计算节点映射策略,将通信划分为特定脑区仿真的高频节点间通信与低频全局通信。通过实际案例我们证明了该方法可带来显著的性能提升。本研究提出了一种面向大规模神经网络仿真的局部-全局混合通信架构,作为将大脑结构映射至超级计算机体系结构的第一步。该成果挑战了长期以来的固有观念——仿真瓶颈源于标准通信库集体调用所固有的同步机制。我们为传统计算系统上神经网络的能效仿真提供了指导原则,并为神经形态系统设定了更高标准。