Transformer networks, driven by self-attention, are central to Large Language Models. In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks. We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention. However, the analog gain cell circuits introduce non-idealities and constraints preventing the direct mapping of pre-trained models. To circumvent this problem, we design an initialization algorithm achieving text processing performance comparable to GPT-2 without training from scratch. Our architecture respectively reduces attention latency and energy consumption by up to two and five orders of magnitude compared to GPUs, marking a significant step toward ultra-fast, low-power generative Transformers.
翻译:Transformer网络以自注意力机制为核心,构成了大语言模型的基础架构。在生成式Transformer中,自注意力机制利用缓存存储词元投影,以避免在每个时间步重新计算。然而,GPU存储的投影向量在每个新生成步骤都必须加载至SRAM中,从而造成延迟与能耗瓶颈。本文提出一种基于新兴电荷存储单元(增益单元)的定制化自注意力存内计算架构。该架构能够在序列生成过程中高效写入新词元,并实现自注意力所需的并行模拟点积运算。然而,模拟增益单元电路存在的非理想特性与约束条件阻碍了预训练模型的直接映射。为解决此问题,我们设计了一种初始化算法,在不从头训练的情况下实现了与GPT-2相当的文本处理性能。相较于GPU,本架构将注意力计算延迟与能耗分别降低最高达两个数量级与五个数量级,为构建超高速、低功耗生成式Transformer迈出了重要一步。