Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat. The code is publicly available at https://github.com/sgl-project/sglang
翻译:大型语言模型(LLMs)日益广泛地应用于需要多次生成调用、高级提示技术、控制流以及结构化输入/输出的复杂任务。然而,目前尚缺乏高效的系统来编程和执行此类应用。本文介绍SGLang,一个用于高效执行复杂语言模型程序的系统。SGLang由前端语言和运行时组成。前端通过提供生成和并行控制原语简化了编程过程。运行时则通过多项创新优化技术加速执行,例如用于KV缓存复用的RadixAttention,以及用于加速结构化输出解码的压缩有限状态机。实验表明,在包括智能体控制、逻辑推理、少样本学习基准测试、JSON解码、检索增强生成流水线以及多轮对话等多种任务上,针对各类大型语言模型和多模态模型,SGLang相比现有最先进的推理系统实现了高达6.4倍的吞吐量提升。代码已公开于 https://github.com/sgl-project/sglang。