Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine- and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine- and human-authored code. Through a rigorous analysis of code attributes such as lexical diversity, conciseness, and naturalness, we expose unique patterns inherent to each source. We particularly notice that the syntactic segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose DetectCodeGPT, a novel method for detecting machine-generated code, which improves DetectGPT by capturing the distinct stylized patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.
翻译:大型语言模型引发了代码生成领域前所未有的浪潮。尽管取得了显著进展,这些模型却模糊了机器编写与人类编写的源代码之间的界限,导致软件制品的完整性与真实性面临挑战。以往的方法(如DetectGPT)已被证明能有效识别机器生成的文本,但未能识别并利用机器生成代码的独特模式,因此在应用于代码时效果受限。本文通过严谨分析词汇多样性、简洁性与自然性等代码属性,系统揭示了机器与人类编写代码各自特有的模式特征。我们特别注意到,代码的句法分割是识别其来源的关键因素。基于这些发现,我们提出了DetectCodeGPT——一种检测机器生成代码的新方法。该方法通过捕捉代码特有的风格化模式,对DetectGPT进行了改进。与传统依赖外部大语言模型进行扰动的方法不同,DetectCodeGPT通过策略性插入空格和换行符来扰动代码语料,在保证检测效能的同时提升了效率。实验结果表明,我们的方法在检测机器生成代码方面显著优于现有最先进技术。