We present a new formulation for parallel matrix multiplication (MM) to out-perform the standard row-column code design. This algorithm is formulated in the MoA formalism (A Mathematics of Arrays) and combines an array view of hardware (dimension-lifting) to extend indexing to physical memory/processing units, with a contiguous data layout derived from static transformations. This view of a hardware-software model is thus a bridging model in the sense of Valiant's BSP. OpenACCcode was derived from the MoA expressions's normal form, producing optimal block sizes using the static information of types and shapes. Experiments were run on Nvidia V100 GPUs and reveal energy consumption which is quadratic in N, i.e. linear in the size of matrix. More generally this approach may be an ideal way of formulating, optimizing, and mapping array algorithms to embedded hardware. This work builds upon recently published results of NREL scientists. .
翻译:我们提出了一种新的并行矩阵乘法(MM)公式化方法,以超越标准的行列式代码设计。该算法基于MoA形式体系(数组数学)构建,通过硬件数组视角(维度提升)将索引扩展到物理内存/处理单元,并结合从静态变换中导出的连续数据布局。这种软硬件模型的视角因此成为了瓦尔兰特BSP模型意义上的桥接模型。OpenACC代码从MoA表达式的规范形式推导得出,利用类型和形状的静态信息生成最优块大小。实验在Nvidia V100 GPU上运行,揭示了能耗与N成二次方关系,即与矩阵规模成线性关系。更广泛而言,该方法可能是将数组算法表述、优化并映射到嵌入式硬件的一种理想方式。本工作基于NREL科学家近期已发表的研究成果。