Tensor computations, with matrix multiplication being the primary operation, serve as the fundamental basis for data analysis, physics, machine learning, and deep learning. As the scale and complexity of data continue to grow rapidly, the demand for tensor computations has also increased significantly. To meet this demand, several research institutions have started developing dedicated hardware for tensor computations. To further improve the computational performance of tensor process units, we have reexamined the issue of computation reuse that was previously overlooked in existing architectures. As a result, we propose a novel EN-T architecture that can reduce chip area and power consumption. Furthermore, our method is compatible with existing tensor processing units. We evaluated our method on prevalent microarchitectures, the results demonstrate an average improvement in area efficiency of 8.7\%, 12.2\%, and 11.0\% for tensor computing units at computational scales of 256 GOPS, 1 TOPS, and 4 TOPS, respectively. Similarly, there were energy efficiency enhancements of 13.0\%, 17.5\%, and 15.5\%.
翻译:张量计算(以矩阵乘法为主要运算)是数据分析、物理学、机器学习和深度学习的数学基础。随着数据规模和复杂性持续快速增长,对张量计算的需求也显著增加。为满足这一需求,多个研究机构已开始研发专用的张量计算硬件。为进一步提升张量处理单元的计算性能,我们重新审视了现有架构中先前被忽视的计算复用问题,并提出了一种新型EN-T架构,该架构能够降低芯片面积与功耗。此外,我们的方法与现有张量处理单元兼容。我们在主流微架构上评估了该方法,结果表明:在256 GOPS、1 TOPS和4 TOPS计算规模下,张量计算单元的面积效率分别平均提升8.7%、12.2%和11.0%,能效则分别提升13.0%、17.5%和15.5%。