Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine-and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine and human-authored code. Through a rigorous analysis of code attributes such as length, lexical diversity, and naturalness, we expose unique pat-terns inherent to each source. We particularly notice that the structural segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose a novel machine-generated code detection method called DetectCodeGPT, which improves DetectGPT by capturing the distinct structural patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.
翻译:大型语言模型已在代码生成领域掀起前所未有的浪潮。在取得显著进展的同时,它们模糊了机器与人类编写源代码之间的界限,引发了软件制品完整性与真实性方面的隐忧。现有方法(如DetectGPT)虽能有效区分机器生成的文本,但未能识别并利用机器生成代码的独特模式,因此其应用性在代码场景下有所减弱。本文系统研究了刻画机器与人类编写代码的特定模式。通过对代码长度、词汇多样性及自然度等属性的严谨分析,我们揭示了不同来源代码的固有规律性。特别注意到,代码的结构化分割是识别其来源的关键因素。基于研究发现,我们提出了一种名为DetectCodeGPT的新型机器生成代码检测方法——该方法通过捕获代码特有的结构模式改进了DetectGPT。与依赖外部大型语言模型进行扰动的传统技术不同,DetectCodeGPT通过策略性地插入空格与换行符扰动代码语料库,在保证效果的同时兼顾效率。实验结果表明,该方法在检测机器生成代码方面显著优于现有最优技术。