Text generation is a compelling sub-field of natural language processing, aiming to generate human-readable text from input words. In particular, the decoder-only generative models, such as generative pre-trained transformer (GPT), are widely used for text generation, with two major computational stages: summarization and generation. Unlike the summarization stage, which can process the input tokens in parallel, the generation stage is difficult to accelerate due to its sequential generation of output tokens through iteration. Moreover, each iteration requires reading a whole model with little data reuse opportunity. Therefore, the workload of transformer-based text generation is severely memory-bound, making the external memory bandwidth system bottleneck. In this paper, we proposed a subarray-level processing-in-memory architecture named SAL-PIM, HBM-based PIM architecture for the end-to-end acceleration of transformer-based text generation. The SAL-PIM architecture includes three architectural features. First, the SAL-PIM architecture utilizes higher internal bandwidth by integrating multiple subarray-level arithmetic logic units with optimized data mapping schemes. Second, the SAL-PIM architecture adopts LUT-based linear interpolation to perform complex non-linear functions in PIM. Third, the SAL-PIM architecture accelerates end-to-end inference on PIM in text generation. Furthermore, to validate the SAL-PIM architecture, we built cycle-accurate simulator and implemented the SAL-PIM's logic units in 28-nm CMOS technology. As a result, when the input size is from 32 to 128 and the output size is from 1 to 256, SAL-PIM achieves a maximum of 4.72 times speedup and an average of 1.83 times speedup for the text generation based on the GPT-2 medium model compared to the server-level GPU.
翻译:文本生成是自然语言处理领域中的一个引人注目的子领域,旨在根据输入词汇生成人类可读的文本。特别地,仅包含解码器的生成模型(如生成式预训练Transformer,GPT)被广泛应用于文本生成,其计算过程主要分为两个阶段:摘要阶段与生成阶段。与可并行处理输入令牌的摘要阶段不同,生成阶段因需通过迭代逐次生成输出令牌而难以加速。此外,每次迭代需读取整个模型,且数据重用机会极少。因此,基于Transformer的文本生成工作负载存在严重的内存瓶颈,导致外部内存带宽成为系统瓶颈。本文提出了一种名为SAL-PIM的子阵列级存内计算架构——一种基于HBM的PIM架构,用于加速基于Transformer的文本生成的端到端推理。SAL-PIM架构包含三个架构特性:第一,通过集成多个子阵列级算术逻辑单元并采用优化的数据映射方案,SAL-PIM架构利用了更高的内部带宽;第二,SAL-PIM架构采用基于LUT的线性插值方法,在PIM中执行复杂的非线性函数;第三,SAL-PIM架构在PIM中加速了文本生成的端到端推理。此外,为验证SAL-PIM架构,我们构建了周期精确的仿真器,并采用28纳米CMOS工艺实现了SAL-PIM的逻辑单元。实验结果表明,当输入规模为32至128且输出规模为1至256时,针对基于GPT-2中等模型的文本生成任务,与服务器级GPU相比,SAL-PIM最高可实现4.72倍的加速,平均加速比达1.83倍。