This study aims to assess the performance of two advanced Large Language Models (LLMs), GPT-3.5 and GPT-4, in the task of code clone detection. The evaluation involves testing the models on a variety of code pairs of different clone types and levels of similarity, sourced from two datasets: BigCloneBench (human-made) and GPTCloneBench (LLM-generated). Findings from the study indicate that GPT-4 consistently surpasses GPT-3.5 across all clone types. A correlation was observed between the GPTs' accuracy at identifying code clones and code similarity, with both GPT models exhibiting low effectiveness in detecting the most complex Type-4 code clones. Additionally, GPT models demonstrate a higher performance identifying code clones in LLM-generated code compared to humans-generated code. However, they do not reach impressive accuracy. These results emphasize the imperative for ongoing enhancements in LLM capabilities, particularly in the recognition of code clones and in mitigating their predisposition towards self-generated code clones--which is likely to become an issue as software engineers are more numerous to leverage LLM-enabled code generation and code refactoring tools.
翻译:本研究旨在评估两种先进大型语言模型(GPT-3.5与GPT-4)在代码克隆检测任务中的性能。评估过程涉及测试模型对来自两个数据集(BigCloneBench人工构建与GPTCloneBench大语言模型生成)中不同克隆类型及相似度级别的多样化代码对的检测能力。研究结果表明,GPT-4在所有克隆类型上均持续优于GPT-3.5。研究观察到GPT模型识别代码克隆的准确率与代码相似度存在相关性,且两种GPT模型在检测最复杂的Type-4代码克隆时均表现欠佳。此外,相较于人工编写代码,GPT模型在识别大语言模型生成代码中的克隆片段时展现出更高性能,但仍未达到令人满意的准确度。这些结果凸显了持续提升大语言模型能力的必要性,特别是在代码克隆识别方面,以及降低其对自身生成代码克隆的检测偏倚——随着软件工程师越来越多地采用基于大语言模型的代码生成与重构工具,这一问题可能日益凸显。