Language models (LMs) are often expected to generate strings in some formal language; for example, structured data, API calls, or code snippets. Although LMs can be tuned to improve their adherence to formal syntax, this does not guarantee conformance, especially with smaller LMs suitable for large-scale deployment. In addition, tuning requires significant resources, making it impractical for uncommon or task-specific formats. To prevent downstream parsing errors we would ideally constrain the LM to only produce valid output, but this is severely complicated by tokenization, which is typically both ambiguous and misaligned with the formal grammar. We solve these issues through the application of automata theory, deriving an efficient closed-form solution for the regular languages, a broad class of formal languages with many practical applications, including API calls or schema-guided JSON and YAML. We also discuss pragmatic extensions for coping with the issue of high branching factor, and extend our techniques to deterministic context-free languages, which similarly admit an efficient closed-form solution. Previous work on this topic (Willard and Louf, 2023) layers bespoke solutions onto automata, leading to problems with speed, correctness, and extensibility. Instead, we reformulate the entire task in terms of automata so we can leverage well-studied and well-optimized algorithms. Our system compiles constraints ~7,000x faster, is provably correct, and can be extended in a modular fashion.
翻译:语言模型(LMs)常被期望生成符合某种形式语言的字符串,例如结构化数据、API调用或代码片段。尽管可以通过微调语言模型来提升其对形式语法的遵循程度,但这并不能保证完全符合规范,尤其对于适合大规模部署的小型语言模型而言。此外,微调需要大量资源,对于不常见或特定任务格式而言并不实用。为避免下游解析错误,理想情况下应将语言模型约束为仅产生有效输出,但由于分词(tokenization)通常既存在歧义又与形式语法不对齐,这一问题变得极为复杂。我们通过应用自动机理论解决了这些问题,为正则语言推导出一种高效的闭式解,正则语言是一类具有广泛应用的形式语言,包括API调用或模式引导的JSON与YAML。我们还讨论了应对高分支因子问题的实用扩展方法,并将技术推广至确定性上下文无关语言,该类语言同样允许高效的闭式解。该领域的先前工作(Willard and Louf, 2023)将定制化解决方案叠加于自动机之上,导致速度、正确性和可扩展性方面的问题。相反,我们将整个任务重新表述为基于自动机的问题,从而能够利用经过深入研究和高度优化的算法。我们的系统约束编译速度提升约7,000倍,具有可证明的正确性,并能以模块化方式扩展。