Modern computing workloads commonly involve matrix-matrix multiplication (mmul) as a core computing pattern. Coarse-Grained Reconfigurable Arrays (CGRAs) can flexibly and efficiently support it, since they combine operation-level reconfigurability and high energy efficiency. However, mapping computational kernels that include mmul with state-of-the-art compilation strategies often leads to suboptimal results, since its multi-dimensional structure hampers the uncovering of its inherent parallelism and, ultimately, runtime performance. Here, we take a different position: we introduce a specialized mmul CGRA kernel schedule, parametrizable across different CGRA sizes. Then, we describe a novel compilation methodology that adapts program representations to effectively leverage it, employing polyhedral transformations to analyze complex computational patterns and expose hidden mmul operations through loop reordering and splitting. The identified patterns are then substituted with optimized assembly, while the remaining program sections are compiled independently. CGRA configurations are then generated, encompassing pre-compiled and compiled parts. Our strategy maximizes resource utilization and ultimately run-time performance, even when mmul is not directly apparent in the source code. The experimental results show speedups up to 9.1x across different benchmarks that contain hidden mmuls and CGRA instances of various sizes.
翻译:现代计算工作负载通常以矩阵乘法(mmul)作为核心计算模式。粗粒度可重构阵列(CGRAs)因其兼具操作级可重构性与高能效特性,能够灵活高效地支持此类计算。然而,采用最先进的编译策略映射包含mmul的计算内核时,由于该运算的多维结构阻碍了其内在并行性的充分发掘并最终影响运行时性能,往往导致次优结果。本文采取不同策略:我们首先提出一种可针对不同CGRA尺寸进行参数化的专用mmul CGRA内核调度方案,随后描述了一种新型编译方法——该方法通过多面体变换分析复杂计算模式,利用循环重排序与拆分暴露隐藏的mmul运算,从而调整程序表示以有效利用该调度方案。识别出的计算模式将被替换为优化后的汇编代码,而其余程序部分则独立编译。随后生成包含预编译与编译部分的CGRA配置。即使源程序中未直观呈现mmul运算,我们的策略仍能最大化资源利用率并最终提升运行时性能。实验结果表明,在包含隐藏mmul运算及不同规模CGRA实例的多个基准测试中,可实现最高9.1倍的加速比。