A Study on Robustness and Reliability of Large Language Model Code Generation

Recently, the large language models (LLMs) have shown extraordinary ability in understanding natural language and generating programming code. It has been a common practice of software engineers to consult LLMs when encountering coding questions. Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability and robustness of the code generationfrom LLMs have not yet been thoroughly studied. The executable code is not equivalent to the reliable and robust code, especially in the context of real-world software development. The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes. To make things worse, the users of LLM code generation services are actually the developers that are most vulnerable to these code that seems right -- They are always novice developers that are not familiar with the APIs that LLMs generate code for them. Therefore, they could hardly tell the misuse in the code generated by LLMs, which further facilitates the incorrect code applied in real-world software. Existing code evaluation benchmark and datasets focus on crafting small tasks such as programming questions in coding interviews, which however deviates from the problem that developers would ask LLM for real-world coding help. To fill the missing piece, in this work, we propose a dataset RobustAPI for evaluating the reliability and robustness of code generated by LLMs. We collect 1208 coding questions from StackOverflow on 24 representative Java APIs. We summarize thecommon misuse patterns of these APIs and evaluate them oncurrent popular LLMs. The evaluation results show that evenfor GPT-4, 62% of the generated code contains API misuses,which would cause unexpected consequences if the code isintroduced into real-world software.

翻译：近期，大型语言模型（LLMs）在理解自然语言和生成编程代码方面展现出非凡能力。软件工程师在遇到编码问题时咨询LLMs已成为常见做法。尽管已有研究致力于避免语法错误并确保代码与预期语义一致，但LLMs生成代码的可靠性和鲁棒性尚未得到深入探究。可执行代码并不等同于可靠且鲁棒的代码，尤其是在实际软件开发场景中。生成代码中的API误用可能导致严重问题，如资源泄漏、程序崩溃。更糟糕的是，LLM代码生成服务的用户恰恰是最易受这些看似正确的代码影响的开发者——他们通常是对LLMs为其生成的API尚不熟悉的新手开发者。因此，他们难以辨别LLMs生成代码中的误用之处，这进一步助长了错误代码在实际软件中的应用。现有代码评估基准和数据集聚焦于编程面试中的小规模问题，但这与开发者实际向LLM寻求真实编码帮助的场景存在偏差。为填补这一空白，本文提出用于评估LLMs生成代码可靠性与鲁棒性的数据集RobustAPI。我们从StackOverflow收集了24个代表性Java API的1208个编码问题，总结这些API的常见误用模式，并在当前主流LLMs上进行评估。评估结果显示，即便对于GPT-4，62%的生成代码仍包含API误用，若将这些代码引入实际软件将导致不可预见的后果。