X-Former: In-Memory Acceleration of Transformers

Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the attention mechanism, which assigns an importance score for every word relative to other words in a sequence. However, these models are very large, often reaching hundreds of billions of parameters, and therefore require a large number of DRAM accesses. Hence, traditional deep neural network (DNN) accelerators such as GPUs and TPUs face limitations in processing Transformers efficiently. In-memory accelerators based on non-volatile memory promise to be an effective solution to this challenge, since they provide high storage density while performing massively parallel matrix vector multiplications within memory arrays. However, attention score computations, which are frequently used in Transformers (unlike CNNs and RNNs), require matrix vector multiplications (MVM) where both operands change dynamically for each input. As a result, conventional NVM-based accelerators incur high write latency and write energy when used for Transformers, and further suffer from the low endurance of most NVM technologies. To address these challenges, we present X-Former, a hybrid in-memory hardware accelerator that consists of both NVM and CMOS processing elements to execute transformer workloads efficiently. To improve the hardware utilization of X-Former, we also propose a sequence blocking dataflow, which overlaps the computations of the two processing elements and reduces execution time. Across several benchmarks, we show that X-Former achieves upto 85x and 7.5x improvements in latency and energy over a NVIDIA GeForce GTX 1060 GPU and upto 10.7x and 4.6x improvements in latency and energy over a state-of-the-art in-memory NVM accelerator.

翻译：Transformer因采用注意力机制在各类自然语言处理任务中取得巨大成功，该机制为序列中每个词语相对于其他词语的重要性进行评分。然而，这类模型参数量巨大，动辄达到数千亿级别，因此需要大量DRAM访问。传统深度神经网络加速器（如GPU和TPU）在处理Transformer时存在效率瓶颈。基于非易失性存储器的内存计算加速器因其高存储密度与大规模并行矩阵向量乘法能力，成为解决该问题的有效方案。但与CNN和RNN不同，Transformer中频繁使用的注意力评分计算要求矩阵向量乘法（MVM）的两个操作数均随输入动态变化，导致传统NVM加速器在应用于Transformer时面临高写入时延、高写入能耗以及多数NVM技术低耐久性的问题。针对这些挑战，我们提出X-Former——一种由NVM和CMOS处理单元组成的混合型内存计算硬件加速架构，可高效执行Transformer工作负载。为提升X-Former的硬件利用率，我们还提出序列分块数据流机制，通过重叠两类处理单元的计算操作来缩短执行时间。在多个基准测试中，X-Former相较NVIDIA GeForce GTX 1060 GPU实现最高85倍时延提升和7.5倍能效提升，相较现有最先进的内存计算NVM加速器实现最高10.7倍时延提升和4.6倍能效提升。