A Study on Robustness and Reliability of Large Language Model Code Generation

Recently, the large language models (LLMs) have shown extraordinary ability in understanding natural language and generating programming code. It has been a common practice of software engineers to consult LLMs when encountering coding questions. Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability and robustness of the code generationfrom LLMs have not yet been thoroughly studied. The executable code is not equivalent to the reliable and robust code, especially in the context of real-world software development.The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes, etc.To make things worse, the users of LLM code generation services are actually the developers that are most vulnerable to these code that seems right -- They are always novice developers that are not familiar with the APIs that LLMs generate code for them. Therefore, they could hardly tell the misuse in the code generated by LLMs, which further facilitates the incorrect code applied in real-world software. Existing code evaluation benchmark and datasets focus on crafting small tasks such as programming questions in coding interviews, which however deviates from the problem that developers would ask LLM for real-world coding help. To fill the missing piece, in this work, we propose a dataset RobustAPI for evaluating the reliability and robustness of code generated by LLMs. We collect 1208 coding questions from StackOverflow on 24 representative Java APIs. We summarize thecommon misuse patterns of these APIs and evaluate them oncurrent popular LLMs. The evaluation results show that evenfor GPT-4, 62% of the generated code contains API misuses,which would cause unexpected consequences if the code isintroduced into real-world software.

翻译：近期，大规模语言模型（LLMs）在理解自然语言和生成编程代码方面展现出非凡能力。软件工程师在遇到编码问题时咨询LLMs已成为常见做法。尽管已有研究致力于避免语法错误并使代码符合预期语义，但LLMs生成代码的可靠性和鲁棒性尚未得到深入研究。可执行代码并不等同于可靠且鲁棒的代码，特别是在实际软件开发场景中。生成代码中的API误用可能导致严重问题，如资源泄漏、程序崩溃等。更糟糕的是，LLM代码生成服务的用户恰恰是最容易受到这些看似正确代码影响的开发者——他们往往是熟悉LLMs为其生成代码的API的新手开发者。因此，他们难以辨别LLMs生成代码中的误用问题，这进一步助长了错误代码在实际软件中的应用。现有代码评估基准和数据集聚焦于编程面试问题等小型任务，但这与开发者向LLM寻求实际编码帮助的问题存在偏差。为填补这一空白，本工作提出用于评估LLMs生成代码可靠性和鲁棒性的数据集RobustAPI。我们从StackOverflow收集了关于24个代表性Java API的1208个编码问题，总结了这些API的常见误用模式，并在当前流行的LLMs上进行了评估。评估结果表明，即便是GPT-4，其生成的代码中也有62%包含API误用，若将这些代码引入实际软件将引发不可预见的后果。