Transformers are at the core of modern AI nowadays. They rely heavily on matrix multiplication and require efficient acceleration due to their substantial memory and computational requirements. Quantization plays a vital role in reducing memory usage, and can be exploited for computations by designing reconfigurable architectures that enhance matrix multiplication by dynamically adjusting the precision. This paper proposes ADiP, a novel adaptive-precision systolic array architecture designed for efficient matrix multiplication acceleration. The proposed architecture consists of $N$ $\times$ $N$ reconfigurable processing elements (PEs), along with shared shifters and accumulators. ADiP supports multiple computation modes, including symmetric single-matrix multiplication as well as asymmetric multi-matrix multiplication with a shared input matrix, thereby improving data reuse and PE utilization. By adapting to different precisions, ADiP achieves up to 4$\times$ higher throughput and up to 4$\times$ higher memory efficiency. Analytical models are developed for ADiP architecture, including latency and throughput for different architecture configurations. A comprehensive hardware design space exploration is demonstrated using commercial 22nm technology. Furthermore, ADiP is evaluated on different Transformer-based workloads from GPT-2 medium, BERT large, and BitNet-1.58B models, delivering total latency improvement up to 53.6%, and total energy improvement up to 24.4% for attention workloads in BitNet-1.58B model. At a 64$\times$64 size with reconfigurable 4,096 PEs, ADiP achieves a peak throughput of 8.192 TOPS, 16.384 TOPS, and 32.768 TOPS for 8bit$\times$8bit, 8bit$\times$4bit, and 8bit$\times$2bit operations, respectively.
翻译:Transformer是现代人工智能的核心基础。它们高度依赖矩阵乘法运算,且因其巨大的内存和计算需求,需要高效的加速方案。量化在减少内存使用方面起着至关重要的作用,并且可以通过设计可重构架构,动态调整计算精度,从而提升矩阵乘法效率。本文提出ADiP,一种新型的自适应精度脉动阵列架构,专为高效加速矩阵乘法而设计。该架构由$N \times N$个可重构处理单元(PE)以及共享移位器和累加器组成。ADiP支持多种计算模式,包括对称的单矩阵乘法以及共享输入矩阵的非对称多矩阵乘法,从而提高了数据重用率和PE利用率。通过适应不同精度,ADiP可实现高达4倍的吞吐量提升和高达4倍的内存效率提升。我们为ADiP架构建立了分析模型,包括不同架构配置下的延迟和吞吐量。基于商用22纳米技术,我们展示了全面的硬件设计空间探索。此外,我们在来自GPT-2 medium、BERT large和BitNet-1.58B模型的不同Transformer工作负载上对ADiP进行了评估,针对BitNet-1.58B模型中的注意力工作负载,总延迟最高可降低53.6%,总能耗最高可降低24.4%。在64$\times$64规模下,配置4096个可重构PE,ADiP在8bit$\times$8bit、8bit$\times$4bit和8bit$\times$2bit运算中分别能达到8.192 TOPS、16.384 TOPS和32.768 TOPS的峰值吞吐量。