General Matrix Multiplication (GEMM) is a fundamental operation widely used in scientific computations. Its performance and accuracy significantly impact the performance and accuracy of applications that depend on it. One such application is semidefinite programming (SDP), and it often requires binary128 or higher precision arithmetic to solve problems involving SDP stably. However, only some processors support binary128 arithmetic, which makes SDP solvers generally slow. In this study, we focused on accelerating GEMM with binary128 arithmetic on field-programmable gate arrays (FPGAs) to enable the flexible design of accelerators for the desired computations. Our binary128 GEMM designs on a recent high-performance FPGA achieved approximately 90GFlops, 147x faster than the computation executed on a recent CPU with 20 threads for large matrices. Using our binary128 GEMM design on the FPGA, we successfully accelerated two numerical applications: LU decomposition and SDP problems, for the first time.
翻译:通用矩阵乘法(GEMM)是科学计算中广泛使用的基础运算,其性能与精度显著影响依赖该运算的应用表现。例如半定规划(SDP)问题需要采用binary128或更高精度的算术运算以确保稳定性,然而仅有少数处理器支持binary128算术运算,导致SDP求解器通常运行缓慢。本研究聚焦于在FPGA(现场可编程门阵列)上加速binary128算术的GEMM运算,从而为目标计算设计灵活的加速器。我们在新型高性能FPGA上实现的binary128 GEMM设计,对于大尺寸矩阵可实现约90GFlops的算力,较采用20线程的现代CPU快147倍。通过将FPGA上的binary128 GEMM设计应用于数值计算,我们首次成功加速了LU分解与SDP问题两类数值应用。