Self-attention in Transformers generates dynamic operands that force conventional Compute-in-Memory (CIM) accelerators into costly non-volatile memory (NVM) reprogramming cycles, degrading throughput and stressing device endurance. Existing solutions either reduce but retain NVM writes through matrix decomposition or sparsity, or move attention computation to digital CMOS at the expense of NVM density. We present TrilinearCIM, a Double-Gate FeFET (DG-FeFET)-based architecture that uses back-gate modulation to realize a three-operand multiply-accumulate primitive for in-memory attention computation without dynamic ferroelectric reprogramming. Evaluated on BERT-base (GLUE) and ViT-base (ImageNet and CIFAR), TrilinearCIM outperforms conventional CIM on seven of nine GLUE tasks while achieving up to 46.6\% energy reduction and 20.4\% latency improvement over conventional FeFET CIM at 37.3\% area overhead. To our knowledge, this is the first architecture to perform complete Transformer attention computation exclusively in NVM cores without runtime reprogramming.
翻译:Transformer中的自注意力机制生成动态操作数,迫使传统存内计算加速器陷入昂贵的非易失性存储器重编程周期,导致吞吐量下降并加速器件磨损。现有解决方案要么通过矩阵分解或稀疏化减少但保留非易失性存储器写入操作,要么将注意力计算转移至数字CMOS单元而以牺牲非易失性存储器密度为代价。本文提出TrilinearCIM——一种基于双栅FeFET(DG-FeFET)的架构,利用背栅调制实现三操作数乘累加原语,可在无需动态铁电重编程的情况下完成存内注意力计算。在BERT-base(GLUE)和ViT-base(ImageNet与CIFAR)上的评估表明,TrilinearCIM在九项GLUE任务中的七项优于传统存内计算方案,并在额外面积开销37.3%的前提下,相比传统FeFET存内计算实现最高46.6%的能耗降低与20.4%的延迟改善。据我们所知,这是首个完全在非易失性存储器核心中完成Transformer注意力计算且无需运行时重编程的架构。