Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability of Large Language Model Code Generation

Recently, the large language models (LLMs) have shown extraordinary ability in understanding natural language and generating programming code. It has been a common practice of software engineers to consult LLMs when encountering coding questions. Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability and robustness of the code generationfrom LLMs have not yet been thoroughly studied. The executable code is not equivalent to the reliable and robust code, especially in the context of real-world software development. The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes. To make things worse, the users of LLM code generation services are actually the developers that are most vulnerable to these code that seems right -- They are always novice developers that are not familiar with the APIs that LLMs generate code for them. Therefore, they could hardly tell the misuse in the code generated by LLMs, which further facilitates the incorrect code applied in real-world software. Existing code evaluation benchmark and datasets focus on crafting small tasks such as programming questions in coding interviews, which however deviates from the problem that developers would ask LLM for real-world coding help. To fill the missing piece, in this work, we propose a dataset RobustAPI for evaluating the reliability and robustness of code generated by LLMs. We collect 1208 coding questions from StackOverflow on 24 representative Java APIs. We summarize thecommon misuse patterns of these APIs and evaluate them oncurrent popular LLMs. The evaluation results show that evenfor GPT-4, 62% of the generated code contains API misuses,which would cause unexpected consequences if the code isintroduced into real-world software.

翻译：近期，大型语言模型（LLMs）在理解自然语言和生成编程代码方面展现出卓越能力。软件工程师在遇到编码问题时咨询LLMs已成为常见做法。尽管已有研究致力于避免语法错误并使代码与预期语义对齐，但LLMs生成代码的可靠性和鲁棒性尚未得到充分探究。可执行代码并不等同于可靠且鲁棒的代码，尤其在现实软件开发场景中。生成代码中的API误用可能导致资源泄漏、程序崩溃等严重问题。更糟糕的是，LLM代码生成服务的用户恰是那些最易受这类"看似正确"代码影响的开发者——他们往往是对LLMs生成代码所涉及API不熟悉的新手开发者。因此，他们难以识别LLMs生成代码中的误用，这进一步助长了错误代码在实际软件中的应用。现有代码评估基准和数据集聚焦于编程面试中的小型编程任务，这与开发者寻求LLMs帮助解决的现实编码问题存在偏差。为填补这一空白，本文提出数据集RobustAPI用于评估LLMs生成代码的可靠性和鲁棒性。我们从StackOverflow收集了涉及24个代表性Java API的1208个编码问题，归纳了这些API的常见误用模式，并在当前主流LLMs上进行了评估。评估结果显示，即便对于GPT-4，仍有62%的生成代码包含API误用，若将这些代码引入实际软件，将导致不可预见的后果。