Modern transformer-based deep neural networks present unique technical challenges for effective acceleration in real-world applications. Apart from the vast amount of linear operations needed due to their sizes, modern transformer models are increasingly reliance on precise non-linear computations that make traditional low-bitwidth quantization methods and fixed-dataflow matrix accelerators ineffective for end-to-end acceleration. To address this need to accelerate both linear and non-linear operations in a unified and programmable framework, this paper introduces TATAA. TATAA employs 8-bit integer (int8) arithmetic for quantized linear layer operations through post-training quantization, while it relies on bfloat16 floating-point arithmetic to approximate non-linear layers of a transformer model. TATAA hardware features a transformable arithmetic architecture that supports both formats during runtime with minimal overhead, enabling it to switch between a systolic array mode for int8 matrix multiplications and a SIMD mode for vectorized bfloat16 operations. An end-to-end compiler is presented to enable flexible mapping from emerging transformer models to the proposed hardware. Experimental results indicate that our mixed-precision design incurs only 0.14% to 1.16% accuracy drop when compared with the pre-trained single-precision transformer models across a range of vision, language, and generative text applications. Our prototype implementation on the Alveo U280 FPGA currently achieves 2935.2 GOPS throughput on linear layers and a maximum of 189.5 GFLOPS for non-linear operations, outperforming related works by up to 1.45x in end-to-end throughput and 2.29x in DSP efficiency, while achieving 2.19x higher power efficiency than modern NVIDIA RTX4090 GPU.
翻译:基于Transformer的现代深度神经网络在现实应用中的高效加速面临着独特的技术挑战。除了因其规模巨大而需要海量线性运算外,现代Transformer模型日益依赖于精确的非线性计算,这使得传统的低位宽量化方法和固定数据流矩阵加速器难以实现端到端的有效加速。为满足在统一可编程框架中同时加速线性与非线性运算的需求,本文提出了TATAA。TATAA通过训练后量化,在线性层运算中采用8位整数(int8)算术;同时依赖bfloat16浮点算术来近似Transformer模型的非线性层。TATAA硬件采用一种可重构算术架构,在运行时以最小开销支持两种数据格式,使其能够在用于int8矩阵乘法的脉动阵列模式与用于向量化bfloat16运算的SIMD模式之间动态切换。本文还提出了一个端到端编译器,以实现从新兴Transformer模型到所提出硬件的灵活映射。实验结果表明,在多种视觉、语言和生成式文本应用中,与预训练的单精度Transformer模型相比,我们的混合精度设计仅导致0.14%至1.16%的精度损失。我们在Alveo U280 FPGA上的原型实现目前在线性层达到2935.2 GOPS的吞吐量,在非线性运算中最高达到189.5 GFLOPS,其端到端吞吐量较相关工作最高提升1.45倍,DSP效率最高提升2.29倍,同时比现代NVIDIA RTX4090 GPU的能效高2.19倍。