Spatial accelerators, composed of arrays of compute-memory integrated units, offer an attractive platform for deploying inference workloads with low latency and low energy consumption. However, fully exploiting their architectural advantages typically requires careful, expert-driven mapping of computational graphs to distributed processing elements. In this work, we automate this process by framing the mapping challenge as a black-box optimization problem. We introduce the first evolutionary, hardware-in-the-loop mapping framework for neuromorphic accelerators, enabling users without deep hardware knowledge to deploy workloads more efficiently. We evaluate our approach on Intel Loihi 2, a representative spatial accelerator featuring 152 cores per chip in a 2D mesh. Our method achieves up to 35% reduction in total latency compared to default heuristics on two sparse multi-layer perceptron networks. Furthermore, we demonstrate the scalability of our approach to multi-chip systems and observe an up to 40% improvement in energy efficiency, without explicitly optimizing for it.
翻译:空间加速器由计算-内存集成单元阵列构成,为部署低延迟、低能耗的推理工作负载提供了一个极具吸引力的平台。然而,要充分发挥其架构优势,通常需要专家精心地将计算图映射到分布式处理单元上。在本工作中,我们将这一映射挑战构建为一个黑盒优化问题,从而实现了该过程的自动化。我们提出了首个面向神经形态加速器的进化式硬件在环映射框架,使得不具备深厚硬件知识的用户也能更高效地部署工作负载。我们在英特尔Loihi 2(一款具有代表性的空间加速器,其每芯片包含152个核心,以二维网格形式排列)上评估了我们的方法。在两个稀疏多层感知器网络上,与默认启发式方法相比,我们的方法实现了高达35%的总延迟降低。此外,我们展示了该方法在多芯片系统上的可扩展性,并在未明确针对其进行优化的情况下,观察到能效提升高达40%。