GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench

Machine learning (ML) has made BigCloneBench popular for semantic clone detection tools. However, BigCloneBench only has a few Java semantic clones. In addition, due to the design principles of how the benchmark was created, imbalance issues have been identified, including the ambiguity in the definition of semantic clones. Thus, ML-based clone detection algorithms trained on BigCloneBench may overlook semantic clones or report incorrect results. The SemanticCloneBench features Stack Overflow clones of several languages. However, it lacks samples for ML-based clone detection. There is also a marked lack of cross-language clone benchmarks. The widely used CLCDSA dataset lacks reusable examples that can't be used in real-world software systems, making it inadequate for ML-based clone detection. The OpenAI GPT-3 model has shown outstanding text production, including code generation and summarization. In this paper, we used the GPT-3 model to generate a complete benchmark for both semantic and cross-language clones. Using SemanticCloneBench's genuine language clones, we tested several prompts to see which yielded better results using GPT-3 question formulation. Then, we used NiCad to filter Type-1 and Type-2 clones from GPT-3 output. We used a GUI-assisted Clone Validator tool to manually validate all clone pairings with nine judges. Functionality testing and CloneCognition verified our benchmark has no syntactic clones. Later, we validated SourcererCC, Oreo and CLCDSA tools on our benchmark. The poor performance of these tools suggests GPTCloneBench has no syntactic clone. From 77,207 Clone pairs of SemanticCloneBench/GPT-3 output, we created a benchmark with 37,149 genuine semantic clone pairs, 19,288 false semantic pairs, and 20,770 cross-language clones across four languages (Java, C, C#, and Python).

翻译：机器学习（ML）使BigCloneBench成为语义克隆检测工具的通用基准。然而，BigCloneBench仅包含少量Java语义克隆。此外，由于该基准的设计原则，存在数据不平衡问题，包括语义克隆定义的模糊性。因此，基于BigCloneBench训练的ML克隆检测算法可能忽略语义克隆或报告错误结果。SemanticCloneBench包含多种语言的Stack Overflow克隆，但缺乏适用于ML克隆检测的样本。跨语言克隆基准亦明显不足。广泛使用的CLCDSA数据集缺乏可在真实软件系统中复用的示例，难以满足ML克隆检测需求。OpenAI的GPT-3模型在文本生成（包括代码生成与摘要）方面表现卓越。本文利用GPT-3模型构建了完整的语义克隆与跨语言克隆基准。基于SemanticCloneBench的真实语言克隆，我们测试了多种提示策略，以确定GPT-3问题表述的最佳方案。随后使用NiCad过滤GPT-3输出中的类型1和类型2克隆，并通过九位评审员借助图形界面辅助的克隆验证工具对所有克隆对进行人工验证。功能测试与CloneCognition验证表明，本基准不含语法克隆。接着，我们在基准上验证了SourcererCC、Oreo和CLCDSA工具，其低性能表现证实GPTCloneBench不含语法克隆。基于SemanticCloneBench/GPT-3输出的77,207个克隆对，我们构建了包含37,149个真实语义克隆对、19,288个虚假语义对以及20,770个跨语言克隆（涵盖Java、C、C#和Python四种语言）的基准。