LMs are often expected to generate strings in some formal language; for example, structured data, API calls, or code snippets. Although LMs can be tuned to improve their adherence to formal syntax, this does not guarantee conformance, especially with smaller LMs suitable for large-scale deployment. In addition, tuning requires significant resources, making it impractical for uncommon or task-specific formats. To prevent downstream parsing errors we would ideally constrain the LM to only produce valid output, but this is severely complicated by tokenization, which is typically both ambiguous and misaligned with the formal grammar. We solve these issues through the application of automata theory, deriving an efficient closed-form solution for the regular languages, a broad class of formal languages with many practical applications, including API calls or schema-guided JSON and YAML. We also discuss pragmatic extensions for coping with the issue of high branching factor. Finally, we extend our techniques to deterministic context-free languages, which similarly admit an efficient closed-form solution. In spite of its flexibility and representative power, our approach only requires access to per-token decoding logits and lowers into simple calculations that are independent of LM size, making it both efficient and easy to apply to almost any LM architecture.
翻译:语言模型常被期望生成符合某种形式语言的字符串,例如结构化数据、API调用或代码片段。尽管可以通过微调语言模型来提升其对形式语法的遵循程度,但这并不能保证完全符合规范,尤其对于适合大规模部署的小型语言模型而言。此外,微调需要大量资源,对于不常见或任务特定的格式并不实用。为避免下游解析错误,理想情况下应将语言模型约束为仅产生有效输出,但由于分词过程通常既存在歧义又与形式语法不对齐,这一问题变得极为复杂。我们通过应用自动机理论解决了这些问题,为正则语言推导出一种高效的闭式解——正则语言作为形式语言的一大类,具有许多实际应用,包括API调用或模式引导的JSON与YAML。我们还讨论了应对高分枝因子问题的实用扩展方法。最后,我们将技术扩展至确定性上下文无关语言,该类语言同样允许高效的闭式解。尽管具有灵活性与强大的表达能力,我们的方法仅需访问每个词元的解码逻辑值,并可简化为与语言模型规模无关的简单计算,从而使其高效且易于应用于几乎任何语言模型架构。