Neural network (NN) accelerators with multi-chip-module (MCM) architectures enable integration of massive computation capability; however, they face challenges of computing resource underutilization and off-chip communication overheads. Traditional parallelization schemes for NN inference on MCM architectures, such as intra-layer parallelism and inter-layer pipelining, show incompetency in breaking through both challenges, limiting the scalability of MCM architectures. We observed that existing works typically deploy layers separately rather than considering them jointly. This underexploited dimension leads to compromises between system computation and communication, thus hindering optimal utilization, especially as hardware/software scale. To address this limitation, we propose Scope, a merged pipeline framework incorporating this overlooked multi-layer dimension, thereby achieving improved throughput and scalability by relaxing tradeoffs between computation, communication and memory costs. This new dimension, however, adds to the complexity of design space exploration (DSE). To tackle this, we develop a series of search algorithms that achieves exponential-to-linear complexity reduction, while identifying solutions that rank in the top 0.05% of performance. Experiments show that Scope achieves up to 1.73x throughput improvement while maintaining similar energy consumption for ResNet-152 inference compared to state-of-the-art approaches.
翻译:采用多芯片模块(MCM)架构的神经网络(NN)加速器能够集成大规模计算能力,然而其面临计算资源利用率不足和片外通信开销的挑战。针对MCM架构上神经网络推理的传统并行化方案(如层内并行和层间流水线)在突破这两方面挑战时均显不足,限制了MCM架构的可扩展性。我们观察到现有工作通常将网络层单独部署而非联合考虑。这一未被充分利用的维度导致系统计算与通信之间的折衷,从而阻碍了最优利用率,尤其在硬件/软件规模扩展时更为明显。为突破此限制,我们提出Scope——一种融合了被忽视的多层维度的流水线框架,通过放宽计算、通信与存储成本之间的权衡,实现了吞吐量与可扩展性的提升。然而,这一新维度增加了设计空间探索(DSE)的复杂性。为此,我们开发了一系列搜索算法,在实现指数级至线性级复杂度降低的同时,能识别出性能排名位于前0.05%的解决方案。实验表明,在ResNet-152推理任务中,相较于最先进方法,Scope在保持相近能耗的同时,最高可实现1.73倍的吞吐量提升。