Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddLLM.
翻译:大语言模型(LLMs)在语言任务上展现出卓越性能,但由于其参数规模庞大且依赖密集乘法运算,导致高内存需求和延迟瓶颈,在资源受限设备上部署面临挑战。移位加法重参数化通过用硬件友好的原语替代LLM中注意力层和多层感知机(MLP)层的高成本乘法运算,提供了一种有前景的解决方案。然而,当前的重参数化技术需要从头训练或全参数微调来恢复精度,这对LLM而言资源消耗过大。为此,我们提出通过训练后移位加法重参数化加速预训练LLM,创建高效的免乘法模型,命名为ShiftAddLLM。具体而言,我们将每个权重矩阵量化为二值矩阵与分组缩放因子配对,并将相关乘法重参数化为(1)激活值与缩放因子之间的移位运算,以及(2)基于二值矩阵的查询与加法运算。为降低精度损失,我们提出一种多目标优化方法,同时最小化权重与输出激活的重参数化误差。此外,根据各层对重参数化的不同敏感性,我们开发了自动化位宽分配策略,进一步减少内存使用和延迟。在五个LLM家族和八项任务上的实验一致验证了ShiftAddLLM的有效性:与最具竞争力的3位和2位量化LLM相比,在可比或更低延迟下,平均困惑度分别提升5.6和22.7个点,同时相较于原始LLM实现超过80%的内存与能耗缩减。代码与模型已开源至https://github.com/GATECH-EIC/ShiftAddLLM。