The advent of automation in particular Software Engineering (SE) tasks has transitioned from theory to reality. Numerous scholarly articles have documented the successful application of Artificial Intelligence to address issues in areas such as project management, modeling, testing, and development. A recent innovation is the introduction of ChatGPT, an ML-infused chatbot, touted as a resource proficient in generating programming codes and formulating software testing strategies for developers and testers respectively. Although there is speculation that AI-based computation can increase productivity and even substitute software engineers in software development, there is currently a lack of empirical evidence to verify this. Moreover, despite the primary focus on enhancing the accuracy of AI systems, non-functional requirements including energy efficiency, vulnerability, fairness (i.e., human bias), and safety frequently receive insufficient attention. This paper posits that a comprehensive comparison of software engineers and AI-based solutions, considering various evaluation criteria, is pivotal in fostering human-machine collaboration, enhancing the reliability of AI-based methods, and understanding task suitability for humans or AI. Furthermore, it facilitates the effective implementation of cooperative work structures and human-in-the-loop processes. This paper conducts an empirical investigation, contrasting the performance of software engineers and AI systems, like ChatGPT, across different evaluation metrics. The empirical study includes a case of assessing ChatGPT-generated code versus code produced by developers and uploaded in Leetcode.
翻译:自动化的出现,特别是在软件工程(SE)任务中的自动化,已从理论转变为现实。大量学术文章记录了人工智能在项目管理、建模、测试和开发等领域成功解决问题的最新应用。一项创新是ChatGPT的引入,这是一种基于机器学习(ML)的聊天机器人,被宣称是一种资源,能够熟练为开发者生成编程代码,并为测试人员制定软件测试策略。尽管有人推测基于AI的计算可以提高生产力,甚至有潜力在软件开发中取代软件工程师,但目前缺乏实证证据来验证这一观点。此外,尽管主要关注提升AI系统的准确性,但非功能性需求(如能源效率、脆弱性、公平性(即人类偏见)和安全性)常常未得到充分重视。本文提出,考虑多种评估标准对软件工程师和基于AI的解决方案进行全面比较,对于促进人机协作、增强基于AI方法的可靠性,以及理解任务对人或AI的适用性至关重要。此外,这有助于有效实施合作工作结构和人在回路流程。本文通过实证研究,对比了软件工程师与诸如ChatGPT等AI系统在不同评估指标下的表现。该实证研究包括一个案例,评估了ChatGPT生成的代码与开发者编写并在Leetcode上提交的代码。