Large Language Models (LLMs) like GPT are state-of-the-art text generation models that provide significant assistance in daily routines. However, LLM execution is inherently sequential, since they only produce one token at a time, thus incurring low hardware utilization on modern GPUs. Batching and speculative decoding are two techniques to improve GPU hardware utilization in LLM inference. To study their synergy, we implement a prototype implementation and perform an extensive characterization analysis on various LLM models and GPU architectures. We observe that the optimal speculation length depends on the batch size used. We analyze the key observation and build a quantitative model to explain it. Based on our analysis, we propose a new adaptive speculative decoding strategy that chooses the optimal speculation length for different batch sizes. Our evaluations show that our proposed method can achieve equal or better performance than the state-of-the-art speculation decoding schemes with fixed speculation length.
翻译:大型语言模型(LLM,如GPT)是当前最先进的文本生成模型,能在日常工作中提供显著辅助。然而,LLM的执行本质上是顺序的——每次仅生成一个令牌,导致现代GPU的硬件利用率较低。批处理与推测性解码是两种提升LLM推理中GPU硬件利用率的技术。为研究二者的协同效应,我们实现了原型系统,并对多种LLM模型及GPU架构进行了全面特性分析。观察发现,最优推测长度取决于所使用的批大小。我们深入分析了这一关键现象,并建立了定量模型进行解释。基于分析结果,我们提出一种新型自适应推测性解码策略,可根据不同批大小选择最优推测长度。评估表明,与采用固定推测长度的最新推测性解码方案相比,我们提出的方法能达到同等或更优的性能。