D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs

The performance gains obtained by large language models (LLMs) are closely linked to their substantial computational and memory requirements. Quantized LLMs offer significant advantages with extremely quantized models, motivating the development of specialized architectures to accelerate their workloads. This paper proposes D-Legion, a novel scalable many-core architecture, designed using many adaptive-precision systolic array cores, to accelerate matrix multiplication in quantized LLMs. The proposed architecture consists of a set of Legions where each Legion has a group of adaptive-precision systolic arrays. D-Legion supports multiple computation modes, including quantized sparse and dense matrix multiplications. The block structured sparsity is exploited within a fully-sparse, or partially-sparse windows. In addition, memory accesses of partial summations (psums) are spatially reduced through parallel accumulators. Furthermore, data reuse is maximized through optimized scheduling techniques by multicasting matrix tiles across the Legions. A comprehensive design space exploration is performed in terms of Legion/core granularity to determine the optimal Legion configuration. Moreover, D-Legion is evaluated on attention workloads from two BitNet models, delivering up to 8.2$\times$ lower latency, up to 3.8$\times$ higher memory savings, and up to 3$\times$ higher psum memory savings compared to state-of-the-art work. D-Legion, with eight Legions and 64 total cores, achieves a peak throughput of 135,68 TOPS at a frequency of 1 GHz. A scaled version of D-Legion, with 32 Legions, is compared to Google TPUv4i, achieving up to 2.5$\times$ lower total latency, up to 2.3$\times$ higher total throughput, and up to 2.7$\times$ higher total memory savings.

翻译：大型语言模型（LLMs）的性能提升与其巨大的计算和内存需求密切相关。量化LLMs通过极端量化模型展现出显著优势，这推动了专门架构的开发以加速其工作负载。本文提出D-Legion，一种新颖的可扩展众核架构，其设计基于多个自适应精度脉动阵列核心，旨在加速量化LLMs中的矩阵乘法运算。所提出的架构由一组Legion组成，每个Legion包含一组自适应精度脉动阵列。D-Legion支持多种计算模式，包括量化稀疏和稠密矩阵乘法。块结构稀疏性在完全稀疏或部分稀疏窗口内得到利用。此外，部分和（psums）的内存访问通过并行累加器在空间上得以减少。进一步地，通过优化的调度技术，矩阵分块在Legion间进行多播传输，从而最大化数据复用。本文针对Legion/核心粒度进行了全面的设计空间探索，以确定最优的Legion配置。此外，D-Legion在两个BitNet模型的注意力工作负载上进行了评估，与最先进的工作相比，实现了高达8.2$\times$的延迟降低、高达3.8$\times$的内存节省以及高达3$\times$的部分和内存节省。D-Legion配置有八个Legion和总计64个核心，在1 GHz频率下实现了135.68 TOPS的峰值吞吐量。一个具有32个Legion的扩展版本D-Legion与Google TPUv4i进行了比较，实现了高达2.5$\times$的总延迟降低、高达2.3$\times$的总吞吐量提升以及高达2.7$\times$的总内存节省。