Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening new opportunities for ubiquitous on-device intelligence. However, this paper identifies that LUT-based inference underutilizes memory bandwidth during parallel inference, which is required for prefilling, test-time scaling, and other multi-token scenarios. The root cause is the scalar LUT paradigm, which performs repetitive and non-contiguous memory accesses for each token. To solve the issue, we propose vector LUT, a new lookup paradigm that constructs a unified LUT across parallel tokens, and performs a single $1 \rightarrow N$ lookup per index. To realize it efficiently, we further introduce (1) Vector LUT-Centric Tensor Layout, and (2) Cache-Aware Streamed Lookup techniques. Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to $4.2\times$. Our implementation is integrated into llama.cpp. The code is available at https://github.com/OpenBitSys/vlut.cpp.
翻译:大语言模型(LLM)正越来越多地部署在边缘设备上。为满足严格的资源约束,实际部署已将LLM量化从8位推进到4位、2位,现已发展至1.58位。结合基于查表(LUT)的推理方法,CPU运行此类超低位LLM甚至比NPU更快,为普及设备端智能开辟了新机遇。然而,本文发现基于LUT的推理在并行推理过程中未能充分利用内存带宽——而并行推理正是预填充(Prefilling)、测试时扩展(Test-time Scaling)及其他多令牌场景所必需的。其根本原因在于标量LUT范式,该范式为每个令牌执行重复且非连续的内存访问。为解决该问题,我们提出向量LUT这一新型查表范式,该范式在并行令牌间构建统一LUT,并为每个索引执行单次$1 \rightarrow N$查表操作。为高效实现该范式,我们进一步引入:(1)以向量LUT为中心的张量布局,以及(2)缓存感知的流式查表技术。在5款边缘设备上针对3种LLM的评估表明,Vec-LUT的性能较现有最优基线方法提升高达$4.2\times$。我们的实现已集成至llama.cpp中。代码已开源:https://github.com/OpenBitSys/vlut.cpp。