The systolic accelerator is one of the premier architectural choices for DNN acceleration. However, the conventional systolic architecture suffers from low PE utilization due to the mismatch between the fixed array and diverse DNN workloads. Recent studies have proposed flexible systolic array architectures to adapt to DNN models. However, these designs support only coarse-grained reshaping or significantly increase hardware overhead. In this study, we propose ReDas, a flexible and lightweight systolic array that supports dynamic fine-grained reshaping and multiple dataflows. First, ReDas integrates lightweight and reconfigurable roundabout data paths, which achieve fine-grained reshaping using only short connections between adjacent PEs. Second, we redesign the PE microarchitecture and integrate a set of multi-mode data buffers around the array. The PE structure enables additional data bypassing and flexible data switching. Simultaneously, the multi-mode buffers facilitate fine-grained reallocation of on-chip memory resources, adapting to various dataflow requirements. ReDas can dynamically reconfigure to up to 129 different logical shapes and 3 dataflows for a 128x128 array. Finally, we propose an efficient mapper to generate appropriate configurations for each layer of DNN workloads. Compared to the conventional systolic array, ReDas can achieve about 4.6x speedup and 8.3x energy-delay product (EDP) reduction.
翻译:脉动加速器是深度神经网络(DNN)加速的主要架构选择之一。然而,传统脉动架构因固定阵列与多样化DNN工作负载之间的不匹配,导致处理单元利用率低下。近年研究提出了适配DNN模型的灵活脉动阵列架构,但这些设计仅支持粗粒度重塑或显著增加硬件开销。本研究提出ReDas——一种灵活轻量的脉动阵列,支持动态细粒度重塑与多数据流。首先,ReDas集成了轻量可重构的环形数据路径,仅通过相邻PE间的短连接实现细粒度重塑。其次,我们重新设计了PE微架构并在阵列周围集成一组多模式数据缓冲器。PE结构支持额外数据旁路与灵活数据切换,同时多模式缓冲器促进片上存储资源的细粒度重新分配,以适应不同数据流的需求。ReDas可针对128×128阵列动态重配置多达129种逻辑形态和3种数据流。最后,我们提出高效映射器为DNN工作负载各层生成适配配置。与传统脉动阵列相比,ReDas可实现约4.6倍加速比和8.3倍能量延迟积(EDP)降低。