The ubiquitous adoption of Large Language Generation Models (LLMs) in programming has underscored the importance of differentiating between human-written code and code generated by intelligent models. This paper specifically aims to distinguish code generated by ChatGPT from that authored by humans. Our investigation reveals disparities in programming style, technical level, and readability between these two sources. Consequently, we develop a discriminative feature set for differentiation and evaluate its efficacy through ablation experiments. Additionally, we devise a dataset cleansing technique, which employs temporal and spatial segmentation, to mitigate the dearth of datasets and to secure high-caliber, uncontaminated datasets. To further enrich data resources, we employ "code transformation," "feature transformation," and "feature customization" techniques, generating an extensive dataset comprising 10,000 lines of ChatGPT-generated code. The salient contributions of our research include: proposing a discriminative feature set yielding high accuracy in differentiating ChatGPT-generated code from human-authored code in binary classification tasks; devising methods for generating extensive ChatGPT-generated codes; and introducing a dataset cleansing strategy that extracts immaculate, high-grade code datasets from open-source repositories, thus achieving exceptional accuracy in code authorship attribution tasks.
翻译:大型语言生成模型(LLMs)在编程领域的普遍应用,凸显了区分人类编写代码与智能模型生成代码的重要性。本文专门致力于区分ChatGPT生成的代码与人类编写的代码。我们的研究发现,这两类代码在编程风格、技术水平和可读性上存在显著差异。据此,我们构建了一套可辨别特征集,并通过消融实验评估其有效性。此外,我们设计了一种数据集清洗技术,采用时空分割方法,以缓解数据匮乏问题,并获取高质量、无污染的数据集。为进一步丰富数据资源,我们运用"代码转换"、"特征转换"和"特征定制"技术,生成了包含10,000行ChatGPT生成代码的大规模数据集。本研究的主要贡献包括:提出了一套可辨别特征集,在二分类任务中能以高准确率区分ChatGPT生成代码与人类编写代码;设计了生成大规模ChatGPT代码的方法;以及提出了一种数据集清洗策略,可从开源代码库中提取纯净、高质量的代码数据集,从而在代码作者归属任务中实现卓越的准确率。