In this paper, we propose FusionCIM, an operator-fusion-driven compute-in-memory (CIM) accelerator architecture for efficient and scalable LLM inference, with three key innovations: (1) a hybrid CIM pipeline architecture that maps QKT computation on inner-product-based CIM (IP-CIM) and PV aggregation on outer-product-based CIM (OP-CIM) for efficient matrix multiplications fusion; (2) a QO-stationary dataflow that eliminates repeated KV loading in CIM and K-matrix access in buffer under transpose fusion, significantly improving data reuse on chip; and (3) a pattern-aware online-softmax mechanism that exploits distribution regularities of attention scores to reduce exponential rescaling overhead for non-linear fusion. Experimental results on LLaMA-3 model show that FusionCIM achieves up to 3.86x energy saving, and 1.98x speedup compared with prior SOTA CIM-based designs with 29.4 TOPS/W energy efficiency at the system level.
翻译:本文提出FusionCIM——一种算符融合驱动的存内计算加速器架构,用于高效可扩展的大型语言模型推理,包含三项关键创新:(1)混合型存内计算流水线架构,将QKT计算映射至内积存内计算单元,将PV聚合映射至外积存内计算单元,实现高效矩阵乘法融合;(2)QO静止数据流,在转置融合场景下消除存内计算中重复的KV加载及缓冲区内K矩阵访问,显著提升芯片内数据复用率;(3)模式感知在线Softmax机制,利用注意力得分的分布规律减少非线性融合中的指数重缩放开销。基于LLaMA-3模型的实验结果表明,与现有最先进的存内计算设计相比,FusionCIM在系统层面可实现最高3.86倍的能耗节省与1.98倍的速度提升,能效比达29.4 TOPS/W。