In recent decades, High Performance Computing (HPC) has undergone significant enhancements, particularly in the realm of hardware platforms, aimed at delivering increased processing power while keeping power consumption within reasonable limits. The Intelligence Processing Unit (IPU) represents an entirely novel category of massively parallel processors, meticulously designed to expedite parallel computations through a multitude of processing cores and on-chip memory components interconnected via high-speed fabrics. While IPUs are primarily tailored for machine learning applications and come equipped with several libraries for the seamless implementation of neural networks, they also retain the capability to execute traditional parallel programs like matrix multiplication. However, it is essential to acknowledge that there are certain considerations and limitations when utilizing IPUs for such tasks. This paper embarks on an extensive analytical examination of matrix multiplications (MM) executed on an IPU, focusing on aspects such as execution efficiency and memory usage. Additionally, a comparative analysis is conducted, pitting the IPU against a GPU. Our findings indicate that IPUs can outperform modern GPUs, especially in handling the consistently challenging skewed matrix multiplication operations. For a more comprehensive understanding, we scrutinize various aspect ratios of matrices for these operations on an IPU and a Turing-class GPU (RTX 2080TI), revealing that the IPU consistently delivers more robust performance when dealing with skewed matrices compared to a GPU.
翻译:近几十年来,高性能计算(HPC)在硬件平台领域取得了显著进步,旨在提供更高处理能力的同时将功耗控制在合理范围内。智能处理单元(IPU)代表了一种全新的超大规模并行处理器类别,其设计通过高速互连网络连接的众多处理核心和片上存储组件来加速并行计算。尽管IPU主要为机器学习应用量身定制,并配备多个库以实现神经网络的便捷部署,但它们仍保留了执行传统并行程序(如矩阵乘法)的能力。然而,需要认识到在利用IPU执行此类任务时存在一定的考量与局限性。本文对IPU上执行的矩阵乘法(MM)开展了广泛分析,重点关注执行效率与内存使用等方面。此外,我们还将IPU与GPU进行了对比分析。研究结果表明,IPU在处理一贯具有挑战性的偏斜矩阵乘法运算时,能够超越现代GPU。为达成更全面的理解,我们考察了IPU与图灵架构GPU(RTX 2080TI)上不同长宽比矩阵的此类运算,揭示出IPU在处理偏斜矩阵时相比GPU始终展现出更强的性能。