Attention for transformers is a critical workload that has recently received significant "attention" as a target for custom acceleration. Yet, while prior work succeeds in reducing attention's memory-bandwidth requirements, it creates load imbalance between operators that comprise the attention computation (resulting in severe compute under-utilization) and requires on-chip memory that scales with sequence length (which is expected to grow over time). This paper ameliorates these issues, enabling attention with nearly 100% compute utilization, no off-chip memory traffic bottlenecks, and on-chip buffer size requirements that are independent of sequence length. The main conceptual contribution is to use a recently proposed abstraction -- the cascade of Einsums -- to describe, formalize, and taxonomize the space of attention algorithms that appear in the literature. In particular, we show how Einsum cascades can be used to infer non-trivial lower bounds on the number of passes a kernel must take through its input data, which has implications for either required on-chip buffer capacity or memory traffic. We show how this notion can be used to meaningfully divide the space of attention algorithms into several categories and use these categories to inform our design process. Based on the above characterization, we propose FuseMax -- a novel mapping and binding of attention onto a spatial array-style architecture. On attention, in an iso-area comparison, FuseMax achieves an average 6.7x speedup over the prior state-of-the-art, FLAT, while using 79\% of the energy. Similarly, on full end-to-end transformer inference, FuseMax achieves an average 5.3x speedup over FLAT using 83 of the energy.
翻译:Transformer中的注意力机制是近年来备受“关注”的关键计算负载,已成为定制加速的重要目标。然而,尽管现有研究成功降低了注意力机制的内存带宽需求,却导致了注意力计算中各算子间的负载不均衡(造成严重的计算利用率不足),且需要随序列长度增长而扩大的片上存储器(而序列长度预计将随时间持续增长)。本文通过改进这些问题,实现了接近100%的计算利用率、无片外内存流量瓶颈,且片上缓冲区大小需求与序列长度无关的注意力计算。主要概念性贡献在于采用近期提出的抽象方法——Einsum级联——来描述、形式化并分类现有文献中的注意力算法空间。特别地,我们展示了如何利用Einsum级联推断核函数必须遍历输入数据的非平凡下界次数,这对所需片上缓冲区容量或内存流量具有重要影响。我们进一步阐释了这一概念如何将注意力算法空间划分为若干有意义的类别,并利用这些类别指导设计流程。基于上述特征分析,我们提出FuseMax——一种在空间阵列架构上实现注意力的新颖映射与绑定方案。在等面积比较中,FuseMax在注意力计算上相较现有最优方案FLAT平均实现6.7倍加速,同时能耗降低至79%。在完整的端到端Transformer推理任务中,FuseMax相较FLAT平均实现5.3倍加速,能耗降低至83%。