Reducing the inference latency of large language models (LLMs) is crucial, and speculative decoding (SD) stands out as one of the most effective techniques. Rather than letting the LLM generate all tokens directly, speculative decoding employs effective proxies to predict potential outputs, which are then verified by the LLM without compromising the generation quality. Yet, deploying SD in real online LLM serving systems (with continuous batching) does not always yield improvement -- under higher request rates or low speculation accuracy, it paradoxically increases latency. Furthermore, there is no best speculation length work for all workloads under different system loads. Based on the observations, we develop a dynamic framework SmartSpec. SmartSpec dynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) -- hence the associated speculative execution costs -- based on a new metric called goodput, which characterizes the current observed load of the entire system and the speculation accuracy. We show that SmartSpec consistently reduces average request latency by up to 3.2x compared to non-speculative decoding baselines across different sizes of target models, draft models, request rates, and datasets. Moreover, SmartSpec can be applied to different styles of speculative decoding, including traditional, model-based approaches as well as model-free methods like prompt lookup and tree-style decoding.
翻译:降低大语言模型(LLMs)的推理延迟至关重要,而推测解码(SD)是最有效的技术之一。推测解码并非让LLM直接生成所有词元,而是采用高效的代理模型来预测潜在输出,随后由LLM进行验证,且不损害生成质量。然而,在实际的在线LLM服务系统(采用连续批处理)中部署SD并不总能带来改进——在较高请求率或较低推测准确率的情况下,它反而会增加延迟。此外,并不存在适用于不同系统负载下所有工作负载的最佳推测长度。基于这些观察,我们开发了一个动态框架SmartSpec。SmartSpec根据一个称为好吞吐量的新指标,动态确定每个请求的最佳推测长度(从0,即无推测,到多个词元)——从而确定相关的推测执行成本。该指标表征了整个系统当前观测到的负载以及推测准确率。我们证明,与无推测解码基线相比,SmartSpec在不同规模的目标模型、草稿模型、请求率和数据集上,能持续将平均请求延迟降低高达3.2倍。此外,SmartSpec可应用于不同风格的推测解码,包括传统的基于模型的方法,以及无模型方法,如提示查找和树状解码。