Transformer models serve as the backbone of many state-ofthe-art language models, and most use the scaled dot-product attention (SDPA) mechanism to capture relationships between tokens. However, the straightforward implementation of SDPA has quadratic compute and memory complexity with respect to the sequence length. On processor architectures such as GPUs and TPUs, there is a robust body of prior work. However, little work has been performed on non-processor architectures.In this work, we show how the architecture and execution model of Streaming Dataflow Accelerators can help tackle this challenge. We first define abstract hardware that adopts a streaming execution model, and we implement a cycle-accurate simulator of the abstract hardware using the Dataflow Abstract Machine simulation framework. Second, we implement the naive SDPA algorithm on this abstract hardware and show it requires linear (O(N)) intermediate memory. Third, we then modify the naive algorithm, taking inspiration from prior processor-oriented works, by reordering the multiplication and division operations. Finally, we map the modified algorithm to abstract hardware, and confirm that the implementation computes SDPA at full throughput while only using a constant amount (O(1)) of intermediate memory.
翻译:Transformer模型是许多最先进语言模型的骨干,其中多数采用缩放点积注意力机制来捕捉标记间的关系。然而,SDPA的直接实现具有与序列长度相关的二次计算和内存复杂度。在GPU和TPU等处理器架构上已有大量前期研究,但针对非处理器架构的工作却很少。本研究展示了流式数据流加速器的架构与执行模型如何应对这一挑战。我们首先定义了采用流式执行模型的抽象硬件,并利用数据流抽象机仿真框架实现了该硬件的周期精确模拟器。其次,在此抽象硬件上实现了朴素的SDPA算法,证明其需要线性(O(N))的中间内存。随后,受先前面向处理器研究的启发,通过重排乘除操作对朴素算法进行改进。最后将改进算法映射至抽象硬件,证实该实现仅使用常量级(O(1))中间内存即可全吞吐量完成SDPA计算。