D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs

The performance gains obtained by large language models (LLMs) are closely linked to their substantial computational and memory requirements. Quantized LLMs offer significant advantages with extremely quantized models, motivating the development of specialized architectures to accelerate their workloads. This paper proposes D-Legion, a novel scalable many-core architecture, designed using many adaptive-precision systolic array cores, to accelerate matrix multiplication in quantized LLMs. The proposed architecture consists of a set of Legions where each Legion has a group of adaptive-precision systolic arrays. D-Legion supports multiple computation modes, including quantized sparse and dense matrix multiplications. The block structured sparsity is exploited within a fully-sparse, or partially-sparse windows. In addition, memory accesses of partial summations (psums) are spatially reduced through parallel accumulators. Furthermore, data reuse is maximized through optimized scheduling techniques by multicasting matrix tiles across the Legions. A comprehensive design space exploration is performed in terms of Legion/core granularity to determine the optimal Legion configuration. Moreover, D-Legion is evaluated on attention workloads from two BitNet models, delivering up to 8.2$\times$ lower latency, up to 3.8$\times$ higher memory savings, and up to 3$\times$ higher psum memory savings compared to state-of-the-art work. D-Legion, with eight Legions and 64 total cores, achieves a peak throughput of 135.68 TOPS at a frequency of 1 GHz. A scaled version of D-Legion, with 32 Legions, is compared to Google TPUv4i, achieving up to 2.5$\times$ lower total latency, up to 2.3$\times$ higher total throughput, and up to 2.7$\times$ higher total memory savings.

翻译：大语言模型（LLM）的性能提升与其庞大的计算和内存需求密切相关。量化LLM通过极低比特量化模型展现出显著优势，推动了专用架构加速其工作负载的发展。本文提出D-Legion——一种新颖的可扩展众核架构，采用多个自适应精度脉动阵列核设计，用于加速量化LLM中的矩阵乘法。该架构由若干"军团"（Legion）组成，每个军团包含一组自适应精度脉动阵列。D-Legion支持多种计算模式，包括量化稀疏矩阵乘法和稠密矩阵乘法。通过全稀疏或部分稀疏窗口利用块结构化稀疏性。此外，通过并行累加器空间上减少了部分和（psum）的内存访问。通过跨军团广播矩阵分块的优化调度技术最大化数据复用。针对军团/核心粒度进行了全面的设计空间探索，以确定最优军团配置。基于两个BitNet模型的注意力工作负载评估显示，与现有最优方案相比，D-Legion可降低延迟高达8.2倍，内存节省最高达3.8倍，psum内存节省最高达3倍。采用8个军团共64个核心的D-Legion在1GHz频率下可实现135.68 TOPS的峰值吞吐量。扩展至32个军团的D-Legion版本与Google TPUv4i相比，总延迟降低高达2.5倍，总吞吐量提升最高达2.3倍，总内存节省提升最高达2.7倍。