Managing the high computational cost of iterative solvers for sparse linear systems is a known challenge in scientific computing. Moreover, scientific applications often face memory bandwidth constraints, making it critical to optimize data locality and enhance the efficiency of data transport. We extend the lattice QCD solver DD-$α$AMG to incorporate multiple right-hand sides (rhs) for both the Wilson-Dirac operator evaluation and the GMRES solver, with and without odd-even preconditioning. To optimize auto-vectorization, we introduce a flexible interface that supports various data layouts and implement a new data layout for better SIMD utilization. We evaluate our optimizations on both x86 and Arm clusters, demonstrating performance portability with similar speedups. A key contribution of this work is the performance analysis of our optimizations, which reveals the complexity introduced by architectural constraints and compiler behavior. Additionally, we explore different implementations leveraging a new matrix instruction set for Arm called SME and provide an early assessment of its potential benefits.
翻译:在科学计算中,管理稀疏线性系统迭代求解器的高计算成本是一个公认的挑战。此外,科学应用常常面临内存带宽限制,这使得优化数据局部性和提高数据传输效率变得至关重要。我们扩展了格点QCD求解器DD-$α$AMG,使其在Wilson-Dirac算子求值和GMRES求解器中均支持多右端项,并涵盖使用与不使用奇偶预处理两种情况。为了优化自动向量化,我们引入了一个支持多种数据布局的灵活接口,并实现了一种新的数据布局以更好地利用SIMD。我们在x86和Arm集群上评估了我们的优化,展示了具有相似加速比的性能可移植性。这项工作的一个关键贡献是对我们优化的性能分析,该分析揭示了由架构约束和编译器行为引入的复杂性。此外,我们探索了利用Arm的一种新矩阵指令集SME的不同实现,并对其潜在优势进行了初步评估。