The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.
翻译:随着LLM智能体应用日益复杂多样,对可解析为代码、结构化函数调用及具身智能体指令的结构化输出需求日益增长,这对LLM推理中的结构化生成提出了显著要求。上下文无关文法是通过约束解码实现结构化生成的灵活方法,但其执行过程需要在运行时遍历词汇表中所有词元的多个栈状态,为结构化生成带来不可忽视的开销。本文提出XGrammar——面向大型语言模型的灵活高效结构化生成引擎。XGrammar通过将词汇表划分为可预先检查的上下文无关词元与需运行时解析的上下文相关词元,加速上下文无关文法的执行。我们进一步构建文法上下文扩展变换以减少上下文无关词元数量,并设计高效持久化栈结构以加速上下文相关词元检查。此外,我们将文法引擎与LLM推理引擎协同设计,使文法计算与GPU执行过程重叠。评估结果表明,XGrammar相较于现有方案最高可实现100倍加速,与LLM推理引擎结合后,能在端到端低延迟LLM服务中实现近乎零开销的结构化生成。