Current architectures are now equipped with matrix computation units designed to enhance AI and high-performance computing applications. Within these architectures, two fundamental instruction types are matrix multiplication and vector outer product, with the latter being lighter due to its vector inputs. This characteristic not only allows for the development of flexible algorithms beyond dense linear algebra computations but also offers greater potential for implementation optimization. Stencil computations, commonly found in scientific and engineering applications, involve nested loops. This paper introduces a novel stencil algorithm leveraging vector outer products. Unlike previous approaches, this algorithm emerges from the stencil definition in scatter mode and is initially formulated using vector outer product expressions. The implementation integrates a series of optimizations to enhance memory reference patterns, execution pipeline efficiency, and data reuse. These optimizations consider various algorithmic options and data sharing among input vectors. Evaluation conducted on a simulator demonstrates that our proposed design achieves significant speedup compared to vectorized stencil algorithms.
翻译:当前架构已配备专为提升人工智能及高性能计算应用而设计的矩阵计算单元。在此类架构中,矩阵乘法与向量外积是两种基础指令类型,其中后者因输入为向量而更为轻量。这一特性不仅使开发超越稠密线性代数计算的灵活算法成为可能,也为实现优化提供了更大潜力。模板计算常见于科学与工程应用,通常涉及嵌套循环。本文提出了一种利用向量外积的新型模板算法。与以往方法不同,该算法从散射模式下的模板定义出发,并首先以向量外积表达式进行公式化。其实现集成了针对内存引用模式、执行流水线效率及数据重用的一系列优化。这些优化综合考虑了多种算法选择及输入向量间的数据共享。在模拟器上进行的评估表明,与向量化模板算法相比,我们提出的设计实现了显著的加速效果。