Integrating Artificial Intelligence with Human Expertise: An In-depth Analysis of ChatGPT's Capabilities in Generating Metamorphic Relations

Context: This paper provides an in-depth examination of the generation and evaluation of Metamorphic Relations (MRs) using GPT models developed by OpenAI, with a particular focus on the capabilities of GPT-4 in software testing environments. Objective: The aim is to examine the quality of MRs produced by GPT-3.5 and GPT-4 for a specific System Under Test (SUT) adopted from an earlier study, and to introduce and apply an improved set of evaluation criteria for a diverse range of SUTs. Method: The initial phase evaluates MRs generated by GPT-3.5 and GPT-4 using criteria from a prior study, followed by an application of an enhanced evaluation framework on MRs created by GPT-4 for a diverse range of nine SUTs, varying from simple programs to complex systems incorporating AI/ML components. A custom-built GPT evaluator, alongside human evaluators, assessed the MRs, enabling a direct comparison between automated and human evaluation methods. Results: The study finds that GPT-4 outperforms GPT-3.5 in generating accurate and useful MRs. With the advanced evaluation criteria, GPT-4 demonstrates a significant ability to produce high-quality MRs across a wide range of SUTs, including complex systems incorporating AI/ML components. Conclusions: GPT-4 exhibits advanced capabilities in generating MRs suitable for various applications. The research underscores the growing potential of AI in software testing, particularly in the generation and evaluation of MRs, and points towards the complementarity of human and AI skills in this domain.

翻译：背景：本文深入探讨了利用OpenAI开发的GPT模型生成和评估蜕变关系（MRs）的方法，特别聚焦于GPT-4在软件测试环境中的能力。目标：旨在检验GPT-3.5和GPT-4为一项源自早期研究的特定被测系统（SUT）所生成的MRs质量，并针对多样化的SUTs引入并应用一套改进的评估标准。方法：初始阶段依据先前研究的标准评估GPT-3.5和GPT-4生成的MRs，随后将增强的评估框架应用于GPT-4为九个多样化SUTs（从简单程序到包含AI/ML组件的复杂系统）生成的MRs。通过定制构建的GPT评估器与人工评估者共同评估MRs，实现了自动化与人工评估方法的直接比较。结果：研究发现GPT-4在生成准确且有用的MRs方面优于GPT-3.5。采用先进的评估标准后，GPT-4展现出为广泛SUTs（包括包含AI/ML组件的复杂系统）生成高质量MRs的显著能力。结论：GPT-4在生成适用于多种应用的MRs方面表现出先进能力。本研究强调了人工智能在软件测试领域日益增长的应用潜力，特别是在MRs的生成与评估方面，并指出了人类与AI技能在该领域的互补性。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日