General matrix multiplication (GEMM) on spatial accelerators is highly sensitive to mapping choices in both execution efficiency and energy consumption. However, the mapping space exhibits combinatorial explosion, which makes it extremely challenging to obtain optimal mappings within an acceptable time budget. Existing approaches typically face challenges: They often lack global-optimality guarantees and become prohibitively slow as the mapping space grows. To address these limitations, we propose \textsc{GOMA}, a geometric-abstraction-based, globally optimal GEMM mapping framework via analytical modeling, which achieves efficient solving while guaranteeing optimality. \textsc{GOMA} introduces, from first principles, a geometric abstraction for GEMM mapping, yielding an exact analytical energy objective with $O(1)$ evaluation for any given mapping. The objective is highly accurate. \textsc{GOMA} then formulates mapping selection as an integer optimization problem under hardware and mapping constraints, using the analytical energy model as the objective to automate mapping search. \textsc{GOMA} can quickly compute a global-optimal mapping for any (GEMM workload, target hardware) pair, achieving this for the first time in mapping space exploration. Experiments confirm that across representative accelerators and large language model prefill workloads, \textsc{GOMA} improves the energy--delay product (EDP) by $2.24$--$4.24\times$ over SOTA mappers, while accelerating time-to-solution by $3.83$--$73.6\times$.
翻译:通用矩阵乘法(GEMM)在空间加速器上的执行效率和能耗高度依赖于映射选择。然而,映射空间会出现组合爆炸,使得在可接受的时间预算内获取最优映射极具挑战性。现有方法通常面临以下难题:它们往往缺乏全局最优性保证,且随着映射空间增长,求解速度会变得极其缓慢。为解决这些局限,我们提出 \textsc{GOMA}——一种基于几何抽象、通过解析建模实现全局最优 GEMM 映射的框架,能够在保证最优性的同时实现高效求解。\textsc{GOMA} 首次从基本原理出发,为 GEMM 映射引入了几何抽象,从而得到精确的解析能耗目标函数,且对任意给定映射的评估复杂度为 $O(1)$。该目标函数具有极高准确性。随后,\textsc{GOMA} 将映射选择形式化为在硬件与映射约束下的整数优化问题,并以解析能耗模型为目标函数,实现映射搜索的自动化。对于任意(GEMM 工作负载,目标硬件)组合,\textsc{GOMA} 能快速计算出全局最优映射,这是映射空间探索中首次实现这一目标。实验证实,在代表性加速器和大型语言模型预填充工作负载上,相较于最先进的映射器,\textsc{GOMA} 将能耗-延迟积(EDP)提升了 $2.24$--$4.24\times$,同时将求解时间加速了 $3.83$--$73.6\times$。