As the landscape of deep neural networks evolves, heterogeneous dataflow accelerators, in the form of multi-core architectures or chiplet-based designs, promise more flexibility and higher inference performance through scalability. So far, these systems exploit the increased parallelism by coarsely mapping a single layer at a time across cores, which incurs frequent costly off-chip memory accesses, or by pipelining batches of inputs, which falls short in meeting the demands of latency-critical applications. To alleviate these bottlenecks, this work explores a new fine-grain mapping paradigm, referred to as layer fusion, on heterogeneous dataflow accelerators through a novel design space exploration framework called Stream. Stream captures a wide variety of heterogeneous dataflow architectures and mapping granularities, and implements a memory and communication-aware latency and energy analysis validated with three distinct state-of-the-art hardware implementations. As such, it facilitates a holistic exploration of architecture and mapping, by strategically allocating the workload through constraint optimization. The findings demonstrate that the integration of layer fusion with heterogeneous dataflow accelerators yields up to 2.2x lower energy-delay product in inference efficiency, addressing both energy consumption and latency concerns. The framework is available open-source at: https://github.com/kuleuven-micas/stream.
翻译:随着深度神经网络的发展,异构数据流加速器(以多核架构或芯粒设计形式呈现)通过可扩展性展现出更高的灵活性和推理性能。目前,这些系统主要通过两种方式利用增强的并行性:一是将单层网络粗粒度地映射到多个核心上,但这会导致频繁且代价高昂的片外存储器访问;二是采用批量输入的流水线处理,却难以满足对延迟敏感的应用需求。为缓解这些瓶颈,本研究通过一个名为Stream的新型设计空间探索框架,在异构数据流加速器上探索了一种称为层融合的细粒度映射范式。Stream能够涵盖多种异构数据流架构和映射粒度,并实现了经过三种不同先进硬件实现验证的、具备内存与通信感知的延迟与能耗分析。通过约束优化策略性地分配工作负载,该框架实现了架构与映射的协同探索。研究结果表明,层融合与异构数据流加速器的结合,在推理效率上实现了高达2.2倍的能耗延迟积降低,同时兼顾了能耗与延迟优化。该框架已在以下网址开源:https://github.com/kuleuven-micas/stream。