Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a programs intent. However, there is typically no guarantee that a programs implementation and natural language documentation are aligned. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. In practice, however, this information is often underutilized due to the inherent ambiguity of natural language which makes natural language intent challenging to check programmatically. The emergent abilities of Large Language Models (LLMs) have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent. Additionally, it is unclear if such translation could be useful in practice. In this paper, we describe nl2postcond, the problem of leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. We introduce and validate metrics to measure and compare different nl2postcond approaches, using the correctness and discriminative power of generated postconditions. We then use qualitative and quantitative methods to assess the quality of nl2postcond postconditions, finding that they are generally correct and able to discriminate incorrect code. Finally, we find that nl2postcond via LLMs has the potential to be helpful in practice; nl2postcond generated postconditions were able to catch 64 real-world historical bugs from Defects4J.

翻译：描述代码功能的非正式自然语言（如代码注释或函数文档）可能包含大量关于程序意图的信息。然而，通常无法保证程序实现与自然语言文档的一致性。当两者产生冲突时，利用代码邻近的自然语言信息有望增强错误定位、调试及代码可信度。但在实践中，由于自然语言固有的歧义性，此类信息往往未能充分利用，导致自然语言意图难以通过编程方式验证。大型语言模型（LLM）的涌现能力有可能促进自然语言意图向可编程验证断言的转化。但尚不明确LLM能否将非正式自然语言规约正确转化为符合程序员意图的形式化规约，也不清楚此类转化是否具有实际应用价值。本文提出nl2postcond问题，即利用LLM将非正式自然语言转化为形式化方法后置条件（以程序断言形式表达）的技术挑战。我们引入并验证了基于生成后置条件的正确性和判别能力的评估指标，用于衡量不同nl2postcond方法的效果。通过定性与定量研究方法评估nl2postcond生成的后置条件质量，发现其整体正确且能有效判别错误代码。最终研究表明，基于LLM的nl2postcond技术具有实际应用潜力——生成的後置条件成功检测到Defects4J中的64个真实历史缺陷。