Context: This paper provides an in-depth examination of the generation and evaluation of Metamorphic Relations (MRs) using GPT models developed by OpenAI, with a particular focus on the capabilities of GPT-4 in software testing environments. Objective: The aim is to examine the quality of MRs produced by GPT-3.5 and GPT-4 for a specific System Under Test (SUT) adopted from an earlier study, and to introduce and apply an improved set of evaluation criteria for a diverse range of SUTs. Method: The initial phase evaluates MRs generated by GPT-3.5 and GPT-4 using criteria from a prior study, followed by an application of an enhanced evaluation framework on MRs created by GPT-4 for a diverse range of nine SUTs, varying from simple programs to complex systems incorporating AI/ML components. A custom-built GPT evaluator, alongside human evaluators, assessed the MRs, enabling a direct comparison between automated and human evaluation methods. Results: The study finds that GPT-4 outperforms GPT-3.5 in generating accurate and useful MRs. With the advanced evaluation criteria, GPT-4 demonstrates a significant ability to produce high-quality MRs across a wide range of SUTs, including complex systems incorporating AI/ML components. Conclusions: GPT-4 exhibits advanced capabilities in generating MRs suitable for various applications. The research underscores the growing potential of AI in software testing, particularly in the generation and evaluation of MRs, and points towards the complementarity of human and AI skills in this domain.
翻译:背景:本文深入探讨了利用OpenAI开发的GPT模型生成和评估蜕变关系(MRs)的方法,特别聚焦于GPT-4在软件测试环境中的能力。目标:旨在检验GPT-3.5和GPT-4为一项源自早期研究的特定被测系统(SUT)所生成的MRs质量,并针对多样化的SUTs引入并应用一套改进的评估标准。方法:初始阶段依据先前研究的标准评估GPT-3.5和GPT-4生成的MRs,随后将增强的评估框架应用于GPT-4为九个多样化SUTs(从简单程序到包含AI/ML组件的复杂系统)生成的MRs。通过定制构建的GPT评估器与人工评估者共同评估MRs,实现了自动化与人工评估方法的直接比较。结果:研究发现GPT-4在生成准确且有用的MRs方面优于GPT-3.5。采用先进的评估标准后,GPT-4展现出为广泛SUTs(包括包含AI/ML组件的复杂系统)生成高质量MRs的显著能力。结论:GPT-4在生成适用于多种应用的MRs方面表现出先进能力。本研究强调了人工智能在软件测试领域日益增长的应用潜力,特别是在MRs的生成与评估方面,并指出了人类与AI技能在该领域的互补性。