Matrix multiplication is a cornerstone operation in a wide array of scientific fields, including machine learning and computer graphics. The standard algorithm for matrix multiplication has a complexity of $\mathcal{O}(n^3)$ for $n\times n$ matrices. Strassen's algorithm improves this to $\mathcal{O}(n^{2.807})$, but its practicality is limited for small to medium matrix sizes due to the large number of additions it introduces. This paper presents a novel FPGA-based implementation of Strassen's algorithm that achieves superior speed over an optimized General Matrix Multiply (GeMM) implementation for matrices as small as $n=256$. Our design, tested extensively on two high-performance FPGA accelerators (Alveo U50 and U280) across various data types, matches or surpasses the performance of a highly optimized baseline across a range of matrix sizes.
翻译:矩阵乘法是包括机器学习和计算机图形学在内的众多科学领域中的核心运算。对于$n\times n$矩阵,标准矩阵乘法算法的时间复杂度为$\mathcal{O}(n^3)$。Strassen算法将此复杂度改进为$\mathcal{O}(n^{2.807})$,但由于其引入了大量加法运算,该算法在中小规模矩阵上的实用性受到限制。本文提出了一种新颖的基于FPGA的Strassen算法实现,对于小至$n=256$的矩阵,其速度优于优化的通用矩阵乘法(GeMM)实现。我们的设计在两种高性能FPGA加速器(Alveo U50和U280)上针对多种数据类型进行了广泛测试,在一系列矩阵规模上均达到或超越了高度优化基线的性能。