Mixture-of-Experts (MoE) layers activate a subset of model weights, dubbed experts, to improve model performance. MoE is particularly promising for deployment on process-in-memory (PIM) architectures, because PIM can naturally fit experts separately and provide great benefits for energy efficiency. However, PIM chips often suffer from large area overhead, especially in the peripheral circuits. In this paper, we propose an area-efficient in-memory computing architecture for MoE transformers. First, to reduce area, we propose a crossbar-level multiplexing strategy that exploits MoE sparsity: experts are deployed on crossbars and multiple crossbars share the same peripheral circuits. Second, we propose expert grouping and group-wise scheduling methods to alleviate the load imbalance and contention overhead caused by sharing. In addition, to address the problem that the expert choice router requires access to all hidden states during generation, we propose a gate-output (GO)cache to store necessary results and bypass expensive additional computation. Experiments show that our approaches improve the area efficiency of the MoE part by up to 2.2x compared to a SOTA architecture. During generation, the cache improves performance and energy efficiency by 4.2x and 10.1x, respectively, compared to the baseline when generating 8 tokens. The total performance density achieves 15.6 GOPS/W/mm2. The code is open source at https://github.com/superstarghy/MoEwithPIM.
翻译:专家混合(Mixture-of-Experts, MoE)层通过激活模型权重的一个子集(称为专家)来提升模型性能。MoE 特别适合部署在存内处理(process-in-memory, PIM)架构上,因为 PIM 可以自然地独立容纳各专家,并在能效方面带来显著优势。然而,PIM 芯片通常面临较大的面积开销,尤其是在外围电路部分。本文提出一种面向 MoE Transformer 的面积高效存内计算架构。首先,为减少面积,我们提出一种利用 MoE 稀疏性的交叉阵列级多路复用策略:专家被部署在交叉阵列上,多个交叉阵列共享同一套外围电路。其次,我们提出专家分组与分组调度方法,以缓解因共享带来的负载不均衡与争用开销。此外,针对生成过程中专家选择路由器需要访问所有隐藏状态的问题,我们提出一种门控输出(gate-output, GO)缓存,用于存储必要的结果并绕过昂贵的额外计算。实验表明,与先进架构相比,我们的方法将 MoE 部分的面积效率最高提升了 2.2 倍。在生成阶段,当生成 8 个令牌时,该缓存相比基线在性能和能效上分别提升了 4.2 倍和 10.1 倍。总体性能密度达到 15.6 GOPS/W/mm²。代码已在 https://github.com/superstarghy/MoEwithPIM 开源。