Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Transformer neural networks, driven by self-attention mechanisms, are core components of foundational and Large Language Models. In generative transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks for long sequences. In this work, we propose a fast and energy-efficient hardware implementation of self-attention using analog in-memory computing based on gain cell memories. Volatile gain cell memories can be efficiently written to store new tokens during sequence generation, while performing analog signed weight multiplications to compute the dot-products required for self-attention. We implement Sliding Window Attention, which keeps memory of a finite set of past steps. A charge-to-pulse converter for array readout eliminates the need for analog-to-digital conversion between self-attention stages. Using a co-designed initialization algorithm to adapt pre-trained weights to gain cell non-idealities, we achieve NLP performance comparable to ChatGPT-2 with minimal training iterations, despite hardware constraints. Our end-to-end hardware design includes digital controls, estimating area, latency, and energy. The system reduces attention latency by up to two orders of magnitude and energy consumption by up to five orders compared to GPUs, marking a significant step toward ultra-fast, low-power sequence generation in Large Language Models.

翻译：Transformer神经网络以自注意力机制为核心，是基础模型与大型语言模型的关键组成部分。在生成式Transformer中，自注意力机制利用缓存存储器存储词元投影，避免在每个时间步重新计算。然而，GPU存储的投影必须在每个新生成步骤加载至SRAM中，导致长序列处理时出现延迟与能耗瓶颈。本研究提出一种基于增益单元存储器的模拟内存计算硬件实现方案，以实现快速节能的自注意力计算。挥发性增益单元存储器可在序列生成过程中高效写入新词元，同时执行模拟有符号权重乘法以计算自注意力所需的点积运算。我们实现了滑动窗口注意力机制，该机制仅保留有限历史步长的记忆。通过采用电荷-脉冲转换器进行阵列读出，消除了自注意力层级间模数转换的需求。借助协同设计的初始化算法将预训练权重适配至增益单元的非理想特性，我们在硬件约束条件下仅需极少训练迭代即可达到与ChatGPT-2相当的自然语言处理性能。我们的端到端硬件设计包含数字控制模块，并估算了面积、延迟与能耗指标。该系统相较于GPU可实现注意力延迟降低两个数量级、能耗降低五个数量级，标志着大型语言模型向超高速低功耗序列生成迈出了重要一步。