Tensor computations, with matrix multiplication being the primary operation, serve as the fundamental basis for data analysis, physics, machine learning, and deep learning. As the scale and complexity of data continue to grow rapidly, the demand for tensor computations has also increased significantly. To meet this demand, several research institutions have started developing dedicated hardware for tensor computations. To further improve the computational performance of tensor process units, we have reexamined the issue of computation reuse that was previously overlooked in existing architectures. As a result, we propose a novel EN-TensorCore architecture that can significantly reduce chip area and power consumption. Furthermore, our method is compatible with existing tensor processing architectures. We evaluated our method on prevalent microarchitectures, the results demonstrate an average improvement in area efficiency of 8.7\%, 12.2\%, and 11.0\% for tensor computing units at computational scales of 256 GOPS, 1 TOPS, and 4 TOPS, respectively. Similarly, there were energy efficiency enhancements of 13.0\%, 17.5\%, and 15.5\%.
翻译:张量计算以矩阵乘法为主要运算,是数据分析、物理学、机器学习及深度学习的基础。随着数据规模与复杂性的持续快速增长,对张量计算的需求也显著增加。为满足这一需求,多个研究机构开始开发面向张量计算的专用硬件。为进一步提升张量处理单元的计算性能,我们重新审视了现有架构中曾被忽视的计算重用问题。由此,我们提出了一种新颖的EN-TensorCore架构,该架构能够显著减少芯片面积与功耗。此外,我们的方法与现有张量处理架构兼容。我们在主流微架构上评估了该方法,结果表明:在256 GOPS、1 TOPS和4 TOPS的计算规模下,张量计算单元的面积效率分别平均提升8.7%、12.2%和11.0%;能效分别提升13.0%、17.5%和15.5%。