A Study on Robustness and Reliability of Large Language Model Code Generation

Recently, the large language models (LLMs) have shown extraordinary ability in understanding natural language and generating programming code. It has been a common practice of software engineers to consult LLMs when encountering coding questions. Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability and robustness of the code generationfrom LLMs have not yet been thoroughly studied. The executable code is not equivalent to the reliable and robust code, especially in the context of real-world software development. The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes. To make things worse, the users of LLM code generation services are actually the developers that are most vulnerable to these code that seems right -- They are always novice developers that are not familiar with the APIs that LLMs generate code for them. Therefore, they could hardly tell the misuse in the code generated by LLMs, which further facilitates the incorrect code applied in real-world software. Existing code evaluation benchmark and datasets focus on crafting small tasks such as programming questions in coding interviews, which however deviates from the problem that developers would ask LLM for real-world coding help. To fill the missing piece, in this work, we propose a dataset RobustAPI for evaluating the reliability and robustness of code generated by LLMs. We collect 1208 coding questions from StackOverflow on 24 representative Java APIs. We summarize thecommon misuse patterns of these APIs and evaluate them oncurrent popular LLMs. The evaluation results show that evenfor GPT-4, 62% of the generated code contains API misuses,which would cause unexpected consequences if the code isintroduced into real-world software.

翻译：近年来，大型语言模型在理解自然语言和生成编程代码方面展现出非凡能力。软件工程师在遇到编程问题时咨询大型语言模型已成为常见做法。尽管已有研究致力于避免语法错误并使代码与预期语义对齐，但大型语言模型代码生成的可靠性与鲁棒性尚未得到深入探讨。可执行代码并不等同于可靠且鲁棒的代码，尤其是在实际软件开发场景中。生成代码中的API误用可能导致资源泄漏、程序崩溃等严重问题。更糟糕的是，大型语言模型代码生成服务的用户恰恰是最容易受这些看似正确代码影响的开发者——他们通常是不熟悉大型语言模型所生成API的新手开发者。因此，他们几乎无法识别大型语言模型生成代码中的误用，这进一步助长了错误代码在实际软件中的应用。现有代码评估基准和数据集聚焦于编程面试题等小型任务，但这与开发者寻求大型语言模型解决实际编程问题的需求存在偏差。为填补这一空白，本文提出数据集RobustAPI，用于评估大型语言模型生成代码的可靠性与鲁棒性。我们从StackOverflow收集了24个代表性Java API的1208个编程问题，归纳了这些API的常见误用模式，并在当前流行的大型语言模型上进行了评估。评估结果显示，即使是GPT-4生成的代码中也有62%存在API误用，若将这些代码引入实际软件将导致不可预见的后果。