The recent progress in generative AI techniques has significantly influenced software engineering, as AI-driven methods tackle common developer challenges such as code synthesis from descriptions, program repair, and natural language summaries for existing programs. Large-scale language models (LLMs), like OpenAI's Codex, are increasingly adopted in AI-driven software engineering. ChatGPT, another LLM, has gained considerable attention for its potential as a bot for discussing source code, suggesting changes, providing descriptions, and generating code. To evaluate the practicality of LLMs as programming assistant bots, it is essential to examine their performance on unseen problems and various tasks. In our paper, we conduct an empirical analysis of ChatGPT's potential as a fully automated programming assistant, emphasizing code generation, program repair, and code summarization. Our study assesses ChatGPT's performance on common programming problems and compares it to state-of-the-art approaches using two benchmarks. Our research indicates that ChatGPT effectively handles typical programming challenges. However, we also discover the limitations in its attention span: comprehensive descriptions can restrict ChatGPT's focus and impede its ability to utilize its extensive knowledge for problem-solving. Surprisingly, we find that ChatGPT's summary explanations of incorrect code provide valuable insights into the developer's original intentions. This insight can be served as a foundation for future work addressing the oracle problem. Our study offers valuable perspectives on the development of LLMs for programming assistance, specifically by highlighting the significance of prompt engineering and enhancing our comprehension of ChatGPT's practical applications in software engineering.
翻译:生成式AI技术的最新进展显著影响了软件工程领域,这类基于AI的方法正应对开发者常见的挑战,例如从描述中合成代码、程序修复以及为现有程序生成自然语言摘要。以OpenAI的Codex为代表的大规模语言模型(LLMs)越来越多地被应用于AI驱动的软件工程中。另一LLM模型ChatGPT作为一款能讨论源代码、建议修改、提供描述并生成代码的对话机器人,已引发广泛关注。为评估LLM作为编程助手机器人的实用性,有必要检验其在未见问题及多种任务上的表现。本文对ChatGPT作为全自动编程助手的潜力进行了实证分析,重点聚焦代码生成、程序修复与代码摘要三方面。我们评估了ChatGPT在常见编程问题上的表现,并利用两个基准数据集将其与当前最优方法进行了比较。研究表明,ChatGPT能有效处理典型的编程挑战。然而,我们也发现了其注意力范围存在的局限性:全面的描述可能限制ChatGPT的聚焦能力,进而阻碍其利用广博知识解决问题。令人惊讶的是,我们发现ChatGPT对错误代码的摘要性解释能为开发者的原始意图提供宝贵洞见。这一发现可作为未来解决“预言机问题”的基础。本研究为开发用于编程辅助的LLM提供了有价值的视角,尤其凸显了提示工程的重要性,并加深了我们对ChatGPT在软件工程领域实际应用的理解。