The currently dominant AI/ML workloads, such as Large Language Models (LLMs), rely on the efficient execution of General Matrix-Matrix Multiplication (GEMM) operations. Thus, most systems are equipped with dedicated matrix hardware accelerators based on square Systolic Arrays (SAs) of Processing Elements (PEs). While this organization was effective for traditional Deep Neural Networks (DNNs), LLMs introduce input-dependent and highly skewed matrices, leading to underutilized SA resources. To address this challenge, we propose SISA (Scale-In Systolic Array), a novel SA architecture that partitions the traditional square array into horizontal rectangular slabs. With minimal overhead, SISA exposes parallelism through independently scheduled slabs for efficient execution of small or skewed matrix shapes, while retaining full-array operation for large GEMMs. SISA achieves up to 8.52x speedup and 93% energy-delay-product (EDP) reduction for representative LLMs compared to a state-of-the-art monolithic SA with the same number of PEs.
翻译:当前主流的人工智能/机器学习工作负载,例如大语言模型(LLM),高度依赖通用矩阵乘法(GEMM)操作的高效执行。因此,大多数系统都配备了基于处理单元(PE)的方形脉动阵列(SA)的专用矩阵硬件加速器。尽管这种组织结构对传统深度神经网络(DNN)有效,但LLM引入了输入依赖且高度稀疏的矩阵,导致SA资源利用率低下。为解决这一挑战,我们提出了SISA(尺度不变脉动阵列),一种新颖的SA架构,它将传统的方形阵列划分为水平矩形板块。SISA以极小的开销,通过独立调度的板块暴露并行性,以高效执行小型或稀疏形状的矩阵,同时在对大型GEMM运算时保持全阵列操作。与采用相同PE数量的最先进单片SA相比,SISA在典型LLM上实现了最高8.52倍的加速比和93%的能耗延迟积(EDP)降低。