The advent of automation in particular Software Engineering (SE) tasks has transitioned from theory to reality. Numerous scholarly articles have documented the successful application of Artificial Intelligence to address issues in areas such as project management, modeling, testing, and development. A recent innovation is the introduction of ChatGPT, an ML-infused chatbot, touted as a resource proficient in generating programming codes and formulating software testing strategies for developers and testers respectively. Although there is speculation that AI-based computation can increase productivity and even substitute software engineers in software development, there is currently a lack of empirical evidence to verify this. Moreover, despite the primary focus on enhancing the accuracy of AI systems, non-functional requirements including energy efficiency, vulnerability, fairness (i.e., human bias), and safety frequently receive insufficient attention. This paper posits that a comprehensive comparison of software engineers and AI-based solutions, considering various evaluation criteria, is pivotal in fostering human-machine collaboration, enhancing the reliability of AI-based methods, and understanding task suitability for humans or AI. Furthermore, it facilitates the effective implementation of cooperative work structures and human-in-the-loop processes. This paper conducts an empirical investigation, contrasting the performance of software engineers and AI systems, like ChatGPT, across different evaluation metrics. The empirical study includes a case of assessing ChatGPT-generated code versus code produced by developers and uploaded in Leetcode.
翻译:随着自动化技术在软件工程(SE)领域特定任务中的应用,其已从理论走向现实。大量学术文献记载了人工智能成功应用于项目管理、建模、测试和开发等问题的案例。最近一项创新是引入了ChatGPT——一种基于机器学习(ML)的聊天机器人,它被宣传为能够分别生成编程代码和制定软件测试策略的资源,用以辅助开发人员和测试人员。尽管有推测认为基于AI的计算可以提高生产力,甚至取代软件开发中的软件工程师,但目前缺乏实证证据来验证这一点。此外,尽管主要关注点在于提升AI系统的准确性,但包括能效、漏洞、公平性(即人类偏见)和安全性在内的非功能性需求常常未能得到充分关注。本文认为,基于多种评估标准对软件工程师与基于AI的解决方案进行全面比较,对于促进人机协作、增强基于AI方法的可靠性,以及理解任务对人类或AI的适用性至关重要。此外,这还有助于有效实施协作工作结构和人机交互流程。本文开展了一项实证研究,对比了软件工程师与ChatGPT等AI系统在不同评估指标上的表现。该实证研究包括一个案例:评估ChatGPT生成的代码与开发人员编写并上传至Leetcode的代码之间的差异。