Transformers are at the core of modern AI nowadays. They rely heavily on matrix multiplication and require efficient acceleration due to their substantial memory and computational requirements. Quantization plays a vital role in reducing memory usage, and can be exploited for computations by designing reconfigurable architectures that enhance matrix multiplication by dynamically adjusting the precision. This paper proposes ADiP, a novel adaptive-precision systolic array architecture designed for efficient matrix multiplication acceleration. The proposed architecture consists of $N$ $\times$ $N$ reconfigurable processing elements (PEs), along with shared shifters and accumulators. ADiP supports multiple computation modes, including symmetric single-matrix multiplication as well as asymmetric multi-matrix multiplication with a shared input matrix, thereby improving data reuse and PE utilization. By adapting to different precisions, ADiP achieves up to 4$\times$ higher throughput and up to 4$\times$ higher memory efficiency. Analytical models are developed for ADiP architecture, including latency and throughput for different architecture configurations. A comprehensive hardware design space exploration is demonstrated using commercial 22nm technology. Furthermore, ADiP is evaluated on different Transformer-based workloads from GPT-2 medium, BERT large, and BitNet-1.58B models, delivering total latency improvement up to 53.6%, and total energy improvement up to 24.4% for attention workloads in BitNet-1.58B model. At a 64$\times$64 size with reconfigurable 4,096 PEs, ADiP achieves a peak throughput of 8.192 TOPS, 16.384 TOPS, and 32.768 TOPS for 8bit$\times$8bit, 8bit$\times$4bit, and 8bit$\times$2bit operations, respectively.
翻译:暂无翻译