The proliferation of Transformer models is often constrained by the significant computational and memory bandwidth demands of deployment. To address this, we present MXFormer, a novel, hybrid, weight-stationary Compute-in-Memory (CIM) accelerator that provides high throughput and efficiency for fixed-model inference on large short-sequence Transformers. Our architecture's foundation is the use of ultra-dense Charge-Trap Transistors (CTTs) in Microscaling MXFP4 CIM arrays, uniquely enabling the on-chip storage of up to hundreds of millions of parameters in Fully Weight Stationary (FWS) fashion. We introduce a statically partitioned design with 12 Transformer blocks connected by a deeply pipelined dataflow. Static-weight layers (MLPs and linear projections) execute on highly parallel analog CTT arrays using an MXFP4-native flow with per-block exponent alignment and a 10-bit SAR ADC. Dynamic computations are handled in fully accurate digital blocks that utilize MXFP-enabled systolic arrays for scaled dot-product attention and vector units for LayerNorm and FlashAttention-style Softmax. By eliminating all weight movement, the deeply pipelined MXFormer architecture yields very high single-stream throughput and efficiency, processing 58275 FPS on ViT-L/32 (dual-chip) or 41269 FPS on ViT-B/16 (single chip). MXFormer outperforms comparable state-of-the-art non-FWS digital, hybrid and photonic Transformer accelerators ~3.3x-60.5x in compute density and ~1.7x-2.5x in energy efficiency. Against FWS accelerators, MXFormer improves compute density by ~20.9x and resident weight storage density by ~2x, while preserving near-digital accuracy (drop of <1%) without any model retraining.
翻译:Transformer模型的广泛应用常受限于部署时巨大的计算与内存带宽需求。为此,我们提出MXFormer,一种新颖的、混合式的、权重固定的存内计算加速器,专为大型短序列Transformer的固定模型推理提供高吞吐率和高能效。我们架构的基础在于在微缩比例MXFP4存内计算阵列中使用超高密度电荷俘获晶体管,独特地实现了以全权重固定方式在芯片上存储高达数亿参数。我们引入了一种静态分区设计,通过深度流水线数据流连接12个Transformer模块。静态权重层(多层感知机和线性投影)在高度并行的模拟电荷俘获晶体管阵列上执行,采用原生MXFP4数据流,具备逐块指数对齐和10位逐次逼近型模数转换器。动态计算则由完全精确的数字模块处理,这些模块利用支持MXFP的脉动阵列进行缩放点积注意力计算,并利用向量单元处理层归一化及FlashAttention风格的Softmax。通过消除所有权重移动,深度流水线的MXFormer架构实现了极高的单流吞吐率和能效,在ViT-L/32(双芯片)上处理速度达58275 FPS,在ViT-B/16(单芯片)上达41269 FPS。MXFormer在计算密度上优于可比的最新非全权重固定数字、混合及光子Transformer加速器约3.3倍至60.5倍,在能效上优于约1.7倍至2.5倍。相较于全权重固定加速器,MXFormer将计算密度提升约20.9倍,驻留权重存储密度提升约2倍,同时保持接近数字计算的精度(精度下降<1%),且无需任何模型重训练。