Recent developments show that Large Language Models (LLMs) produce state-of-the-art performance on natural language (NL) to code generation for resource-rich general-purpose languages like C++, Java, and Python. However, their practical usage for structured domain-specific languages (DSLs) such as YAML, JSON is limited due to domain-specific schema, grammar, and customizations generally unseen by LLMs during pre-training. Efforts have been made to mitigate this challenge via in-context learning through relevant examples or by fine-tuning. However, it suffers from problems, such as limited DSL samples and prompt sensitivity but enterprises maintain good documentation of the DSLs. Therefore, we propose DocCGen, a framework that can leverage such rich knowledge by breaking the NL-to-Code generation task for structured code languages into a two-step process. First, it detects the correct libraries using the library documentation that best matches the NL query. Then, it utilizes schema rules extracted from the documentation of these libraries to constrain the decoding. We evaluate our framework for two complex structured languages, Ansible YAML and Bash command, consisting of two settings: Out-of-domain (OOD) and In-domain (ID). Our extensive experiments show that DocCGen consistently improves different-sized language models across all six evaluation metrics, reducing syntactic and semantic errors in structured code. We plan to open-source the datasets and code to motivate research in constrained code generation.
翻译:近期研究表明,大型语言模型(LLMs)在自然语言(NL)到代码生成任务中,对于C++、Java、Python等资源丰富的通用编程语言已取得最先进的性能。然而,由于领域特定的模式、语法和定制化内容通常在LLMs的预训练过程中未曾出现,它们在YAML、JSON等结构化领域特定语言(DSLs)的实际应用受到限制。已有研究尝试通过相关示例的上下文学习或微调来缓解这一挑战,但这类方法存在DSL样本有限、提示敏感等问题。而企业通常对DSL语言维护着完善的文档。为此,我们提出DocCGen框架,该框架能够利用此类丰富的知识,将结构化代码语言的NL到代码生成任务分解为两个步骤:首先,通过匹配NL查询的库文档检测出正确的库;随后,利用从这些库文档中提取的模式规则来约束解码过程。我们在两种复杂结构化语言(Ansible YAML和Bash命令)上评估了该框架,包含域外(OOD)和域内(ID)两种设置。大量实验表明,DocCGen在全部六项评估指标上持续提升了不同规模语言模型的性能,有效减少了结构化代码中的语法与语义错误。我们将开源数据集与代码,以推动受控代码生成领域的研究。