Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.
翻译:张量处理单元(TPU)是最著名的机器学习(ML)加速器之一,广泛应用于数据中心及微型ML应用。与图形处理单元(GPU)等传统ML加速器相比,TPU针对深度神经网络(DNN)执行过程中大量存在的矩阵-矩阵与矩阵-向量乘法所需的乘累加(MAC)运算进行了专门设计,从而实现了多项改进与优势。这些改进包括通过利用脉动阵列架构提供的时间数据流范式,最大化数据复用并最小化数据传输。虽然这种设计带来了显著的性能优势,但现有实现仅限于单一数据流架构(输入驻留、输出驻留或权重驻留),这可能限制DNN推理可达到的性能并降低计算单元的利用率。因此,本研究开发了一种可重构数据流TPU,称为Flex-TPU,它能够在运行时按层动态切换数据流。我们通过多个知名ML工作负载的实验,全面测试了Flex-TPU与传统TPU设计相比的可行性。结果表明,与常规TPU相比,Flex-TPU设计实现了最高达2.75倍的显著性能提升,而面积和功耗开销仅为小幅增加。