LMs are often expected to generate strings in some formal language; for example, structured data, API calls, or code snippets. Although LMs can be tuned to improve their adherence to formal syntax, this does not guarantee conformance, especially with smaller LMs suitable for large-scale deployment. In addition, tuning requires significant resources, making it impractical for uncommon or task-specific formats. To prevent downstream parsing errors we would ideally constrain the LM to only produce valid output, but this is severely complicated by tokenization, which is typically both ambiguous and misaligned with the formal grammar. We solve these issues through the application of automata theory, deriving an efficient closed-form solution for the regular languages, a broad class of formal languages with many practical applications, including API calls or schema-guided JSON and YAML. We also discuss pragmatic extensions for coping with the issue of high branching factor. Finally, we extend our techniques to deterministic context-free languages, which similarly admit an efficient closed-form solution. In spite of its flexibility and representative power, our approach only requires access to per-token decoding logits and lowers into simple calculations that are independent of LM size, making it both efficient and easy to apply to almost any LM architecture.
翻译:语言模型常被期望生成符合特定形式语言的字符串,例如结构化数据、API调用或代码片段。尽管可以通过调优语言模型以提升其对形式语法的遵循程度,但这并不能保证完全符合规范,尤其对于适用于大规模部署的小型语言模型而言。此外,调优过程需要大量资源,对于不常见或任务特定的格式而言往往不切实际。为避免下游解析错误,理想情况下应约束语言模型仅产生有效输出,但由于分词过程通常既存在歧义又与形式语法不对齐,这一问题变得极为复杂。我们通过应用自动机理论解决这些问题,为包括API调用或模式引导的JSON与YAML在内的、具有广泛实际应用的正则语言类,推导出一种高效的闭式解。同时,我们讨论了应对高分支因子问题的实用扩展方法。最后,我们将技术扩展至确定性上下文无关语言,该类语言同样允许高效的闭式解。尽管方法具有灵活性与强表达能力,其仅需访问每个标记的解码逻辑值,并可简化为与语言模型规模无关的简单计算,从而兼具高效性与易用性,几乎适用于任何语言模型架构。