空间填充曲线即所需：通信规避矩阵乘法简化实现 (Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple)

General Matrix Multiplication (GEMM) is the cornerstone of Deep Learning and HPC workloads; accordingly, academia and industry have heavily optimized this kernel. Modern platforms with matrix multiplication accelerators exhibit high FLOP/Byte machine balance, which makes implementing optimal matrix multiplication challenging. On modern CPU platforms with matrix engines, state-of-the-art vendor libraries tune input tensor layouts, parallelization schemes, and cache blocking to minimize data movement across the memory hierarchy and maximize throughput. However, the best settings for these parameters depend strongly on the target platform (number of cores, memory hierarchy, cache sizes) and on the shapes of the matrices, making exhaustive tuning infeasible; in practice this leads to performance "glass jaws". In this work we revisit space filling curves (SFC) to alleviate the problem of this cumbersome tuning. SFC convert multi-dimensional coordinates (e.g. 2D) into a single dimension (1D), keeping nearby points in the high-dimensional space close in the 1D order. We partition the Matrix Multiplication computation space using recent advancements in generalized SFC (Generalized Hilbert Curves), and we obtain platform-oblivious and shape-oblivious matrix-multiplication schemes that exhibit inherently high degree of data locality. Furthermore, we extend the SFC-based work partitioning to implement Communication-Avoiding (CA) algorithms that replicate the input tensors and provably minimize communication/data-movement on the critical path. The integration of CA-algorithms is seamless and yields compact code (~30 LOC), yet it achieves state-of-the-art results on multiple CPU platforms, outperforming vendor libraries by up to 2x(geometric-mean speedup) for a range of GEMM shapes.

翻译：通用矩阵乘法（GEMM）是深度学习与高性能计算负载的基石，因此学术界与工业界已对该核心算法进行了深度优化。配备矩阵乘法加速器的现代平台展现出高浮点运算/字节比的机器平衡特性，这使得实现最优矩阵乘法具有挑战性。在配备矩阵引擎的现代CPU平台上，顶尖厂商库通过调整输入张量布局、并行化方案与缓存分块来最小化内存层次间的数据移动并最大化吞吐量。然而，这些参数的最佳配置高度依赖于目标平台（核心数量、内存层次、缓存容量）及矩阵形状，导致穷举调优不可行；实践中这会造成性能"玻璃颌"现象。本研究重新审视空间填充曲线以缓解繁琐调优问题。SFC将多维坐标（如二维）转换为单维序列，使高维空间中相邻点在单维排序中保持邻近。我们利用广义SFC（广义希尔伯特曲线）的最新进展对矩阵乘法计算空间进行划分，获得了具备平台无关性与形状无关性的矩阵乘法方案，这些方案天然具有高度数据局部性。此外，我们将基于SFC的工作划分扩展至通信规避算法实现，通过复制输入张量并在关键路径上可证明地最小化通信/数据移动。CA算法的集成过程无缝且生成简洁代码（约30行），在多个CPU平台上仍取得领先性能，针对一系列GEMM形状超越厂商库达2倍（几何平均加速比）。