This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality. Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1.5-2.8x for distributed LLM inference and by 2.6-3.5x for offloading-based LLM inference, while preserving the same generative performance. SpecInfer is publicly available at https://github.com/flexflow/FlexFlow/
翻译:本文介绍SpecInfer系统,该系统通过树状推测推理与验证机制加速生成式大语言模型的推理服务。SpecInfer的核心思想在于利用轻量级推测模型预测LLM的输出,并将预测结果组织为令牌树结构,其中每个节点代表一个候选令牌序列。通过创新的树状并行解码机制,所有令牌树所表示的候选令牌序列均可与LLM并行完成正确性验证。SpecInfer将LLM作为令牌树验证器而非增量解码器,在严格保证模型质量的前提下,显著降低了生成式LLM服务的端到端延迟和计算需求。实验表明,SpecInfer在分布式LLM推理场景下性能提升1.5-2.8倍,在基于卸载的LLM推理场景下性能提升2.6-3.5倍,同时保持相同的生成性能。SpecInfer开源代码可于https://github.com/flexflow/FlexFlow/获取。