As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit the largest available models (50B+ parameters) and must offload them to RAM or SSD. When running with offloaded parameters, the inference engine can process batches of hundreds or thousands of tokens at the same time as just one token, making it a natural fit for speculative decoding. We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families. It utilizes the high spikiness of the token probabilities distribution in modern LLMs and a high degree of alignment between model output probabilities. SpecExec takes the most probable tokens continuation from the draft model to build a "cache" tree for the target model, which then gets validated in a single pass. Using SpecExec, we demonstrate inference of 50B+ parameter LLMs on consumer GPUs with RAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights.
翻译:随着大语言模型获得广泛采用,高效运行这些模型变得至关重要。近期关于大语言模型推理的研究利用推测解码实现了极致的加速。然而,这些工作大多默认为其算法设计面向高端数据中心硬件。在本研究中,我们提出相反的问题:在消费级机器上运行大语言模型能有多快?消费级GPU已无法容纳最大可用模型(500亿以上参数),必须将其卸载至RAM或SSD。当使用卸载参数运行时,推理引擎可以同时处理数百或数千个令牌的批次,其成本与处理单个令牌相当,这使其天然适合推测解码。我们提出SpecExec(推测执行),一种简单的并行解码方法,可为流行的大语言模型系列在每次目标模型迭代中生成多达20个令牌。它利用了现代大语言模型中令牌概率分布的高度尖峰特性以及模型输出概率间的高度对齐性。SpecExec从草稿模型中获取最可能的令牌延续以构建目标模型的“缓存”树,随后通过单次前向传播进行验证。使用SpecExec,我们在消费级GPU上通过RAM卸载实现了500亿以上参数大语言模型的推理,在4位量化下达到每秒4-6个令牌,或在16位权重下达到每秒2-3个令牌。