Matrix computation units have been equipped in current architectures to accelerate AI and high performance computing applications. The matrix multiplication and vector outer product are two basic instruction types. The latter one is lighter since the inputs are vectors. Thus it provides more opportunities to develop flexible algorithms for problems other than dense linear algebra computing and more possibilities to optimize the implementation. Stencil computations represent a common class of nested loops in scientific and engineering applications. This paper proposes a novel stencil algorithm using vector outer products. Unlike previous work, the new algorithm arises from the stencil definition in the scatter mode and is initially expressed with formulas of vector outer products. The implementation incorporates a set of optimizations to improve the memory reference pattern, execution pipeline and data reuse by considering various algorithmic options and the data sharing between input vectors. Evaluation on a simulator shows that our design achieves a substantial speedup compared with vectorized stencil algorithm.
翻译:当前处理器架构已配备矩阵计算单元以加速人工智能与高性能计算应用。矩阵乘法和向量外积是两种基本指令类型。后者因输入为向量而更轻量,从而为开发除稠密线性代数计算之外的灵活算法提供了更多可能性,并为优化实现创造了更多机会。模板计算是科学与工程应用中常见的一类嵌套循环。本文提出一种基于向量外积的新型模板算法。与现有工作不同,该算法从散射模式下的模板定义出发,初始阶段即采用向量外积公式进行表达。其实现通过综合考量多种算法选项及输入向量间的数据共享,融入了一系列优化手段以改善内存访问模式、执行流水线效率和数据复用性。在模拟器上的评估表明,与向量化模板算法相比,本设计实现了显著的加速比。