The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead. In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional data-type-centric multiplication to bit-wise table lookup, and enables a unified and scalable mpGEMM solution. Our LUT-based kernels scale linearly to the weight bit-width. Evaluated on low-bit Llama and BitNet models, T-MAC demonstrates up to 4x increase in throughput and 70% reduction in energy consumption compared to llama.cpp. For BitNet-b1.58-3B, T-MAC delivers a token generation throughput of 30 tokens/s with a single core and 71 tokens/s with eight cores on M2-Ultra, and 11 tokens/s on lower-end devices like Raspberry Pi 5, which significantly exceeds the adult average reading speed. T-MAC with LUT-based computing paradigm, paves the way for the practical deployment of low-bit LLMs on resource-constrained edge devices without compromising computational efficiency. The system is open-sourced at https://github.com/microsoft/T-MAC .
翻译:在边缘设备上部署大语言模型对于提升端侧智能日益重要。权重量化是减少大语言模型在设备上内存占用的关键。然而,低比特大语言模型在推理过程中需要对低精度权重和高精度激活值进行混合精度矩阵乘法。现有系统因缺乏对混合精度矩阵乘法的原生支持,通常采用将权重反量化为高精度再进行计算的间接方式,这会带来显著的推理开销。本文提出T-MAC,一种创新的基于查找表的方法,专为在CPU上高效执行低比特大语言模型(即权重量化的大语言模型)推理而设计。T-MAC无需反量化即可直接支持混合精度矩阵乘法,同时消除了乘法操作并减少了所需的加法次数。具体而言,T-MAC将传统的以数据类型为中心的乘法运算转换为基于比特位的查表操作,从而实现了统一且可扩展的混合精度矩阵乘法解决方案。我们基于查找表的内核性能随权重比特位宽线性扩展。在低比特Llama和BitNet模型上的评估表明,与llama.cpp相比,T-MAC实现了高达4倍的吞吐量提升和70%的能耗降低。对于BitNet-b1.58-3B模型,T-MAC在M2-Ultra处理器上单核可达到30 tokens/s的令牌生成吞吐量,八核可达71 tokens/s;在树莓派5等低端设备上也能达到11 tokens/s,显著超过了成人的平均阅读速度。T-MAC及其基于查表的计算范式,为在不牺牲计算效率的前提下,在资源受限的边缘设备上实际部署低比特大语言模型铺平了道路。该系统已在 https://github.com/microsoft/T-MAC 开源。