Recently, large language models (LLM) based generative AI has been gaining momentum for their impressive high-quality performances in multiple domains, particularly after the release of the ChatGPT. Many believe that they have the potential to perform general-purpose problem-solving in software development and replace human software developers. Nevertheless, there are in a lack of serious investigation into the capability of these LLM techniques in fulfilling software development tasks. In a controlled 2 $\times$ 2 between-subject experiment with 109 participants, we examined whether and to what degree working with ChatGPT was helpful in the coding task and typical software development task and how people work with ChatGPT. We found that while ChatGPT performed well in solving simple coding problems, its performance in supporting typical software development tasks was not that good. We also observed the interactions between participants and ChatGPT and found the relations between the interactions and the outcomes. Our study thus provides first-hand insights into using ChatGPT to fulfill software engineering tasks with real-world developers and motivates the need for novel interaction mechanisms that help developers effectively work with large language models to achieve desired outcomes.
翻译:摘要:近期,基于大型语言模型(LLM)的生成式人工智能因其在多个领域中令人印象深刻的高质量表现而势头强劲,尤其是在ChatGPT发布之后。许多人认为,它们具有在软件开发中执行通用问题求解并取代人类软件开发者的潜力。然而,当前尚缺乏对这些LLM技术在完成软件开发任务方面能力的严谨研究。通过一项109名参与者的受控2×2组间实验,我们考察了与ChatGPT合作在编码任务和典型软件开发任务中是否有帮助及帮助程度,以及人们如何与ChatGPT协作。研究发现,ChatGPT在解决简单编码问题时表现良好,但在支持典型软件开发任务时表现欠佳。我们还观察了参与者与ChatGPT之间的交互过程,并发现了交互模式与任务结果之间的关系。因此,本研究提供了关于真实开发者使用ChatGPT完成软件工程任务的一手见解,并揭示了开发新型交互机制以帮助开发者有效利用大型语言模型实现预期目标的迫切需求。