General matrix multiplication (GEMM) on spatial accelerators is highly sensitive to mapping choices in both execution efficiency and energy consumption. However, the mapping space exhibits combinatorial explosion, which makes it extremely challenging to obtain optimal mappings within an acceptable time budget. Existing approaches typically face challenges: They often lack global-optimality guarantees and become prohibitively slow as the mapping space grows. To address these limitations, we propose \textsc{GOMA}, a geometric-abstraction-based, globally optimal GEMM mapping framework via analytical modeling, which achieves efficient solving while guaranteeing optimality. \textsc{GOMA} introduces, from first principles, a geometric abstraction for GEMM mapping, yielding an exact analytical energy objective with $O(1)$ evaluation for any given mapping. The objective is highly accurate. \textsc{GOMA} then formulates mapping selection as an integer optimization problem under hardware and mapping constraints, using the analytical energy model as the objective to automate mapping search. \textsc{GOMA} can quickly compute a global-optimal mapping for any (GEMM workload, target hardware) pair, achieving this for the first time in mapping space exploration. Experiments confirm that across representative accelerators and large language model prefill workloads, \textsc{GOMA} improves the energy--delay product (EDP) by $2.24$--$4.24\times$ over SOTA mappers, while accelerating time-to-solution by $3.83$--$73.6\times$.
翻译:在空间加速器上执行通用矩阵乘法(GEMM)时,其执行效率与能耗对映射策略的选择极为敏感。然而,映射空间存在组合爆炸问题,这使得在可接受的时间预算内获取最优映射极具挑战性。现有方法通常面临以下难题:它们往往缺乏全局最优性保证,并且随着映射空间增长,求解速度会变得极其缓慢。为应对这些局限,我们提出了 \textsc{GOMA}——一个基于几何抽象、通过解析建模实现全局最优 GEMM 映射的框架,该框架在保证最优性的同时实现了高效求解。\textsc{GOMA} 从基本原理出发,为 GEMM 映射引入了一种几何抽象,从而为任意给定映射推导出具有 $O(1)$ 评估复杂度的精确解析能耗目标函数。该目标函数具有很高的准确性。随后,\textsc{GOMA} 将映射选择问题表述为硬件与映射约束下的整数优化问题,并以该解析能耗模型为目标函数,实现映射搜索的自动化。\textsc{GOMA} 能够为任意(GEMM 工作负载,目标硬件)组合快速计算出一个全局最优映射,这在映射空间探索领域尚属首次。实验证实,在代表性加速器与大语言模型预填充工作负载上,\textsc{GOMA} 相较于当前最先进的映射器,将能量-延迟乘积(EDP)提升了 $2.24$--$4.24\times$,同时将求解时间加速了 $3.83$--$73.6\times$。