Large Language Models are powerful tools for program synthesis and advanced auto-completion, but come with no guarantee that their output code is syntactically correct. This paper contributes an incremental parser that allows early rejection of syntactically incorrect code, as well as efficient detection of complete programs for fill-in-the-middle (FItM) tasks. We develop Earley-style parsers that operate over left and right quotients of arbitrary context-free grammars, and we extend our incremental parsing and quotient operations to several context-sensitive features present in the grammars of many common programming languages. The result of these contributions is an efficient, general, and well-grounded method for left and right quotient parsing. To validate our theoretical contributions -- and the practical effectiveness of certain design decisions -- we evaluate our method on the particularly difficult case of FItM completion for Python 3. Our results demonstrate that constrained generation can significantly reduce the incidence of syntax errors in recommended code.
翻译:大型语言模型是程序合成和高级自动补全的强大工具,但无法保证其输出代码的语法正确性。本文提出了一种增量解析器,一方面能够早期拒绝语法不正确的代码,另一方面可在填充中间(Fill-in-the-Middle, FItM)任务中高效检测完整程序。我们开发了基于Earley算法的解析器,可处理任意上下文无关语法的左右商,并将增量解析与商操作扩展到多种主流编程语言语法中存在的上下文相关特征。这些贡献的成果是一种高效、通用且理论扎实的左右商解析方法。为验证理论贡献及若干设计决策的实际有效性,我们以Python 3的FItM补全这一极具挑战性的任务为例进行评估。结果表明,受约束生成能显著降低推荐代码中的语法错误发生率。