GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench

With the emergence of Machine Learning, there has been a surge in leveraging its capabilities for problem-solving across various domains. In the code clone realm, the identification of type-4 or semantic clones has emerged as a crucial yet challenging task. Researchers aim to utilize Machine Learning to tackle this challenge, often relying on the BigCloneBench dataset. However, it's worth noting that BigCloneBench, originally not designed for semantic clone detection, presents several limitations that hinder its suitability as a comprehensive training dataset for this specific purpose. Furthermore, CLCDSA dataset suffers from a lack of reusable examples aligning with real-world software systems, rendering it inadequate for cross-language clone detection approaches. In this work, we present a comprehensive semantic clone and cross-language clone benchmark, GPTCloneBench by exploiting SemanticCloneBench and OpenAI's GPT-3 model. In particular, using code fragments from SemanticCloneBench as sample inputs along with appropriate prompt engineering for GPT-3 model, we generate semantic and cross-language clones for these specific fragments and then conduct a combination of extensive manual analysis, tool-assisted filtering, functionality testing and automated validation in building the benchmark. From 79,928 clone pairs of GPT-3 output, we created a benchmark with 37,149 true semantic clone pairs, 19,288 false semantic pairs(Type-1/Type-2), and 20,770 cross-language clones across four languages (Java, C, C#, and Python). Our benchmark is 15-fold larger than SemanticCloneBench, has more functional code examples for software systems and programming language support than CLCDSA, and overcomes BigCloneBench's qualities, quantification, and language variety limitations.

翻译：随着机器学习的兴起，利用其能力解决各领域问题已成为热潮。在代码克隆领域，第四类（语义）克隆的识别已成为一项关键且具挑战性的任务。研究者常借助BigCloneBench数据集，试图通过机器学习应对这一挑战。然而，BigCloneBench最初并非为语义克隆检测设计，存在若干局限，难以胜任此特定用途的全面训练数据集。此外，CLCDSA数据集因缺乏与真实软件系统相匹配的可复用示例，导致其无法有效支持跨语言克隆检测方法。本研究通过结合SemanticCloneBench与OpenAI的GPT-3模型，提出了一个综合性语义克隆与跨语言克隆基准——GPTCloneBench。具体而言，我们以SemanticCloneBench中的代码片段作为样本输入，结合针对GPT-3模型的适当提示工程，生成这些片段的语义克隆与跨语言克隆。随后通过人工分析、工具辅助过滤、功能测试及自动验证相结合的方式构建基准。从GPT-3输出的79,928个克隆对中，我们最终筛选出37,149个真实语义克隆对、19,288个虚假语义对（第一类/第二类）及20,770个跨语言克隆对（涵盖Java、C、C#与Python四种语言）。该基准规模较SemanticCloneBench扩大15倍，比CLCDSA包含更多面向软件系统的功能性代码示例与语言支持，并克服了BigCloneBench在质量、量化指标及语言多样性方面的局限。