This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality. Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1.5-2.8x for distributed LLM inference and by 2.6-3.5x for offloading-based LLM inference, while preserving the same generative performance. SpecInfer is publicly available at https://github.com/flexflow/FlexFlow/
翻译:本文介绍SpecInfer系统,该系统通过基于树结构的推测推理与验证加速生成式大语言模型(LLM)服务。SpecInfer的核心思想是利用小型推测模型预测LLM的输出,并将预测结果组织成令牌树结构,其中每个节点代表一个候选令牌序列。通过新型树结构并行解码机制,令牌树所表示的所有候选令牌序列的正确性可并行地由LLM进行验证。SpecInfer将LLM用作令牌树验证器而非增量解码器,这显著降低了生成式LLM服务的端到端延迟与计算需求,同时可证明地保持了模型质量。实验评估表明,SpecInfer在分布式LLM推理场景下性能较现有LLM服务系统提升1.5-2.8倍,在基于卸载的LLM推理场景下提升2.6-3.5倍,同时保持相同的生成性能。SpecInfer已开源发布于https://github.com/flexflow/FlexFlow/