This paper advocates for an intertwined design of the dense linear algebra software stack that breaks down the strict barriers between the high-level, blocked algorithms in LAPACK (Linear Algebra PACKage) and the low-level, architecture-dependent kernels in BLAS (Basic Linear Algebra Subprograms). Specifically, we propose customizing the GEMM (general matrix multiplication) kernel, which is invoked from the blocked algorithms for relevant matrix factorizations in LAPACK, to improve performance on modern multicore processors with hierarchical cache memories. To achieve this, we leverage an analytical model to dynamically adapt the cache configuration parameters of the GEMM to the shape of the matrix operands. Additionally, we accommodate a flexible development of architecture-specific micro-kernels that allow us to further improve the utilization of the cache hierarchy. Our experiments on two platforms, equipped with ARM (NVIDIA Carmel, Neon) and x86 (AMD EPYC, AVX2) multi-core processors, demonstrate the benefits of this approach in terms of better cache utilization and, in general, higher performance. However, they also reveal the delicate balance between optimizing for multi-threaded parallelism versus cache usage.
翻译:本文提倡对密集线性代数软件栈进行交织设计,打破LAPACK(线性代数软件包)中高层分块算法与BLAS(基础线性代数子程序)中底层架构相关内核之间的严格界限。具体而言,我们提出定制GEMM(通用矩阵乘法)内核——该内核被LAPACK中相关矩阵分解的分块算法所调用——以提升在具有层次化缓存存储器的现代多核处理器上的性能。为此,我们利用解析模型动态调整GEMM的缓存配置参数以适应矩阵操作数的形状。此外,我们支持灵活开发架构特定的微内核,从而进一步提高缓存层次结构的利用率。我们在配备ARM(NVIDIA Carmel、Neon)和x86(AMD EPYC、AVX2)多核处理器的两个平台上的实验表明,该方法在缓存利用率优化及整体性能提升方面具有优势。然而,实验结果也揭示了在多线程并行化优化与缓存使用之间需要维持微妙的平衡。