The high computational and memory requirements of generative large language models (LLMs) make it challenging to serve them quickly and cheaply. This paper introduces SpecInfer, an LLM serving system that accelerates generative LLM inference with speculative inference and token tree verification. A key insight behind Specinfer is to combine various collectively boost-tuned small language models to jointly predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality. Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1.3-2.4x for distributed LLM inference and by 2.6-3.5x for offloading-based LLM inference, while preserving the same generative performance. SpecInfer is publicly available at https://github.com/flexflow/FlexFlow/tree/inference.
翻译:生成式大语言模型(LLM)高昂的计算和内存需求使得快速廉价地提供服务面临挑战。本文提出SpecInfer,一种利用推测推理和令牌树验证加速生成式LLM推理的LLM服务系统。SpecInfer的关键洞见在于结合多种集体协同微调的小语言模型共同预测LLM的输出;这些预测被组织为一个令牌树,其节点各代表一个候选令牌序列。通过一种新颖的基于树的并行解码机制,令牌树所代表的所有候选令牌序列的正确性得以在LLM上并行验证。SpecInfer将LLM作为令牌树验证器而非增量解码器使用,这显著降低了服务生成式LLM的端到端延迟和计算需求,同时可证明地保持了模型质量。我们的评估表明,SpecInfer在分布式LLM推理中比现有LLM服务系统性能提升1.3-2.4倍,在基于卸载的LLM推理中提升2.6-3.5倍,同时保持相同的生成性能。SpecInfer开源代码可在https://github.com/flexflow/FlexFlow/tree/inference获取。