EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture

Large Language Models (LLMs) have achieved impressive performance across diverse domains but remain inefficient during the autoregressive decoding phase. Unlike the prefill stage, which employs compute-bound GEMM operations, decoding executes a sequence of small GEMV-like computations that are memory-bound and underutilize modern accelerators. Weight-only vector quantization (VQ) has emerged as an effective compression technique that clusters model weights into a shared codebook and replaces the original weight matrix with low-precision indices, enabling 2-bit-level weight compression. While this approach substantially reduces model size and memory bandwidth, it still suffers from two critical inefficiencies: the low utilization of GEMV computation and frequent memory conflicts during codebook lookups. This paper presents EVA, an efficient vector-quantization-based architecture that addresses both computational and memory bottlenecks in LLM decoding. EVA builds on a simple yet effective insight that combines input-codebook computation with conflict-free memory access. Instead of reconstructing quantized weights from indices, EVA directly performs dot products between input vectors and the weight codebook, transforming LLM decoding from GEMV to GEMM computation. It then performs structured lookups from an intermediate output buffer, eliminating memory bank conflicts. We further design a hardware-software co-optimized architecture specialized for LLM decoding while remaining compatible with conventional prefill execution. Evaluations show that EVA achieves up to 11.17$\times$ speedup and 7.17$\times$ higher energy efficiency compared with the SOTA lookup-based architecture, while preserving arithmetic precision after vector quantization. Our code is available at https://github.com/dbw6/Eva.git.

翻译：大语言模型在多个领域展现出卓越性能，但在自回归解码阶段仍存在效率瓶颈。与采用计算受限型GEMM运算的预填充阶段不同，解码过程需执行一系列类似GEMV的小型计算任务，此类任务受限于内存带宽，难以充分利用现代加速器性能。仅权重的向量量化作为一种高效的压缩技术，通过将模型权重聚类至共享码本并用低精度索引替代原始权重矩阵，实现了2比特级权重压缩。尽管该方法显著缩减了模型规模与内存带宽需求，但仍存在两项关键效率缺陷：GEMV计算利用率低下与码本查找过程中的频繁内存冲突。本文提出EVA——一种基于向量量化的高效架构，旨在同时解决大模型解码中的计算与内存瓶颈。EVA基于一个简洁而有效的核心洞察：将输入-码本计算与无冲突内存访问相结合。不同于从索引中重建量化权重，EVA直接对输入向量与权重码本执行点积运算，将大模型解码从GEMV计算范式转化为GEMM计算范式，继而通过中间输出缓冲区执行结构化查找，消除内存库冲突。我们进一步设计了专为大模型解码优化的软硬件协同架构，同时保持与传统预填充执行的兼容性。评估结果表明，相较于当前最先进的基于查找表的架构，EVA在保持向量量化算术精度的前提下，可实现最高11.17倍加速比与7.17倍能效提升。我们的代码已开源至https://github.com/dbw6/Eva.git。