Full Stack Optimization of Transformer Inference: a Survey

Sehoon Kim,Coleman Hooper,Thanakul Wattanawong,Minwoo Kang,Ruohan Yan,Hasan Genc,Grace Dinh,Qijing Huang,Kurt Keutzer,Michael W. Mahoney,Yakun Sophia Shao,Amir Gholami

Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right mapping and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search. Finally, we perform a case study by applying the surveyed optimizations on Gemmini, the open-source, full-stack DNN accelerator generator, and we show how each of these approaches can yield improvements, compared to previous benchmark results on Gemmini. Among other things, we find that a full-stack co-design approach with the aforementioned methods can result in up to 88.7x speedup with a minimal performance degradation for Transformer inference.

翻译：近期最先进的深度神经网络架构设计趋势正向Transformer模型迈进。这类模型在广泛的应用场景中展现出卓越的准确率。自Transformer模型问世以来，这一趋势在过去数年间始终如一。然而，当前Transformer模型推理所需的计算量与带宽正以显著速度增长，这使得其在延迟敏感型应用中的部署面临挑战。为此，提升Transformer模型效率已成为研究重点，相关方法涵盖从架构设计变革到专用领域特定加速器的开发。本文系统梳理了Transformer高效推理的各类方法，包括：(i) 现有Transformer架构瓶颈的分析与画像，及其与先前卷积模型的异同；(ii) Transformer架构对硬件的影响，涵盖层归一化、Softmax和GELU等非线性运算以及线性运算对硬件设计的冲击；(iii) 固定Transformer架构的优化方法；(iv) 为Transformer模型寻找正确操作映射与调度方案的挑战；(v) 通过神经架构搜索调整架构以优化Transformer模型的方案。最后，我们以开源全栈DNN加速器生成器Gemmini为案例，将所综述的优化方法逐一应用，结果显示相较于Gemmini此前基准测试结果，各方法均能带来性能提升。特别值得注意的是，采用上述方法进行全栈协同设计，可在最小化性能损失的前提下，实现高达88.7倍的Transformer推理加速。