SynCode: LLM Generation with Grammar Augmentation

LLMs are widely used in complex AI applications. These applications underscore the need for LLM outputs to adhere to a specific format, for their integration with other components in the systems. Typically the format rules e.g., for data serialization formats such as JSON, YAML, or Code in Programming Language are expressed as context-free grammar (CFG). Due to the hallucinations and unreliability of LLMs, instructing LLMs to adhere to specified syntax becomes an increasingly important challenge. We present SynCode, a novel framework for efficient and general syntactical decoding with LLMs, to address this challenge. SynCode leverages the CFG of a formal language, utilizing an offline-constructed efficient lookup table called DFA mask store based on the discrete finite automaton (DFA) of the language grammar terminals. We demonstrate SynCode's soundness and completeness given the CFG of the formal language, presenting its ability to retain syntactically valid tokens while rejecting invalid ones. SynCode seamlessly integrates with any language defined by CFG, as evidenced by experiments focusing on generating JSON, Python, and Go outputs. Our experiments evaluating the effectiveness of SynCode for JSON generation demonstrate that SynCode eliminates all syntax errors and significantly outperforms state-of-the-art baselines. Furthermore, our results underscore how SynCode significantly reduces 96.07% of syntax errors in generated Python and Go code, showcasing its substantial impact on enhancing syntactical precision in LLM generation. Our code is available at https://github.com/uiuc-focal-lab/syncode

翻译：[translated abstract in Chinese] 大语言模型（LLMs）广泛应用于复杂的人工智能应用。这些应用要求LLM的输出必须遵循特定格式，以便与系统中的其他组件集成。通常，格式规则（例如数据序列化格式JSON、YAML或编程语言代码）由上下文无关文法（CFG）定义。由于LLM存在幻觉和不可靠性问题，指导其遵循指定语法已成为一项日益重要的挑战。我们提出SynCode——一种用于LLM高效通用句法解码的新框架，以应对该挑战。SynCode利用形式语言的CFG，基于语言语法终结符的离散有限自动机（DFA）构建离线高效查找表（称为DFA掩码存储）。我们证明了SynCode在给定形式语言CFG下的正确性与完备性，展示其保留合法语法令牌并拒绝非法令牌的能力。SynCode可与任何由CFG定义的语言无缝集成，针对JSON、Python和Go输出的实验验证了这一点。评估SynCode在JSON生成中有效性的实验表明，SynCode消除了所有语法错误，并显著优于当前最先进的基线方法。此外，我们的结果强调SynCode在生成的Python和Go代码中减少了96.07%的语法错误，展示了其在提升LLM生成句法精度方面的显著影响。我们的代码可在https://github.com/uiuc-focal-lab/syncode获取。