Formalizing Natural Language Intent into Program Specifications via Large Language Models

Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a programs intent. However, there is typically no guarantee that a programs implementation and natural language documentation are aligned. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. In practice, however, this information is often underutilized due to the inherent ambiguity of natural language which makes natural language intent challenging to check programmatically. The "emergent abilities" of Large Language Models (LLMs) have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent. Additionally, it is unclear if such translation could be useful in practice. In this paper, we describe LLM4nl2post, the problem leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. We introduce and validate metrics to measure and compare different LLM4nl2post approaches, using the correctness and discriminative power of generated postconditions. We then perform qualitative and quantitative methods to assess the quality of LLM4nl2post postconditions, finding that they are generally correct and able to discriminate incorrect code. Finally, we find that LLM4nl2post via LLMs has the potential to be helpful in practice; specifications generated from natural language were able to catch 70 real-world historical bugs from Defects4J.

翻译：描述代码功能的非正式自然语言（如代码注释或函数文档）可能包含大量关于程序意图的信息。然而，通常无法保证程序实现与自然语言文档的一致性。当二者存在冲突时，利用代码相关自然语言中的信息有望增强故障定位、调试和代码可信度。但实践中，由于自然语言固有的歧义性导致难以通过编程方式检验自然语言意图，这类信息往往未被充分利用。大语言模型（LLMs）的"涌现能力"有潜力促进将自然语言意图转化为可通过编程检查的断言。然而，尚不清楚LLMs能否正确地将非正式自然语言规范转化为与程序员意图匹配的形式化规范。此外，这种转化在实践中的实用性仍不明确。本文提出LLM4nl2post问题，即利用LLMs将非正式自然语言转化为形式化的方法后置条件（以程序断言形式表达）。我们引入并验证了用于衡量和比较不同LLM4nl2post方法的指标，这些指标基于生成后置条件的正确性和判别能力。随后，我们采用定性与定量方法评估LLM4nl2post生成后置条件的质量，发现其通常正确且具有区分错误代码的能力。最后，我们发现基于LLMs的LLM4nl2post在实践中具有潜在价值：从自然语言生成的规范能够捕获Defects4J中70个真实历史缺陷。