The ubiquitous adoption of Large Language Generation Models (LLMs) in programming has underscored the importance of differentiating between human-written code and code generated by intelligent models. This paper specifically aims to distinguish code generated by ChatGPT from that authored by humans. Our investigation reveals disparities in programming style, technical level, and readability between these two sources. Consequently, we develop a discriminative feature set for differentiation and evaluate its efficacy through ablation experiments. Additionally, we devise a dataset cleansing technique, which employs temporal and spatial segmentation, to mitigate the dearth of datasets and to secure high-caliber, uncontaminated datasets. To further enrich data resources, we employ "code transformation," "feature transformation," and "feature customization" techniques, generating an extensive dataset comprising 10,000 lines of ChatGPT-generated code. The salient contributions of our research include: proposing a discriminative feature set yielding high accuracy in differentiating ChatGPT-generated code from human-authored code in binary classification tasks; devising methods for generating extensive ChatGPT-generated codes; and introducing a dataset cleansing strategy that extracts immaculate, high-grade code datasets from open-source repositories, thus achieving exceptional accuracy in code authorship attribution tasks.
翻译:大型语言生成模型(LLMs)在编程领域的广泛应用凸显了区分人类编写代码与智能模型生成代码的重要性。本文旨在精准识别ChatGPT生成代码与人类编写代码之间的差异。研究发现两者在编程风格、技术水平和可读性方面存在显著差异。基于此,我们构建了一套判别性特征集,并通过消融实验评估其有效性。同时,我们提出采用时空分割方法的数据集清洗技术,以缓解数据集稀缺问题并获取高质量、无污染的数据集。为进一步扩充数据资源,我们运用"代码变换"、"特征变换"和"特征定制"技术,生成了包含10,000行ChatGPT生成代码的扩展数据集。本研究的主要贡献包括:提出能够高精度区分ChatGPT生成代码与人类编写代码的二进制分类判别特征集;建立大规模ChatGPT生成代码的生成方法;以及引入从开源代码库中提取纯净优质数据集的清洗策略,在代码作者归属任务中取得了卓越准确率。