Recently, tensor algebra have witnessed significant applications across various domains. Each operator in tensor algebra features different computational workload and precision. However, current general accelerators, such as VPU, GPGPU, and CGRA, support tensor operators with low energy and area efficiency. This paper conducts an in-depth exploration of general accelerator for tensor processing. First, we find the similarity between matrix multiplication and precision multiplication, and create a method classifying tensor operators. Then, we implement two discoveries and introduce the systolic architecture into general-purpose accelerator. Therefore, we propose a new General Tensor Accelerator (GTA), which has a better area efficiency and data reuse. Furthermore, we create a large hardware scheduling space consisting of dataflow, precision and array resize. Our evaluation results demonstrate that GTA is able to achieves 7.76X, 5.35X, 8.76X memory efficiency and 6.45X, 3.39X, 25.83X speedup over of VPU, GPGPU and CGRA.
翻译:近期,张量代数在多个领域取得了显著应用。张量代数中的每个算子具有不同的计算负载与精度要求。然而,当前的通用加速器(如VPU、GPGPU和CGRA)在支持张量算子时存在能量效率与面积效率较低的问题。本文对面向张量处理的通用加速器进行了深入探索。首先,我们发现了矩阵乘法与精度乘法之间的相似性,并提出了一种张量算子分类方法。随后,我们基于两项发现,将脉动阵列架构引入通用加速器设计中。因此,我们提出了一种新型通用张量加速器(GTA),该加速器具有更优的面积效率与数据重用能力。此外,我们构建了一个包含数据流、精度和阵列尺寸重配置的大型硬件调度空间。评估结果表明,与VPU、GPGPU和CGRA相比,GTA能够实现7.76倍、5.35倍、8.76倍的存储效率提升,以及6.45倍、3.39倍、25.83倍的加速比。