Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine- and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine- and human-authored code. Through a rigorous analysis of code attributes such as lexical diversity, conciseness, and naturalness, we expose unique patterns inherent to each source. We particularly notice that the syntactic segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose DetectCodeGPT, a novel method for detecting machine-generated code, which improves DetectGPT by capturing the distinct stylized patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.
翻译:大型语言模型在代码生成领域引发了前所未有的浪潮。尽管取得了显著进展,但它们模糊了机器生成与人类编写源代码之间的界限,引发了软件工件的完整性和真实性问题。先前的方法如DetectGPT在区分机器生成的文本方面已被证明有效,但这些方法未能识别并利用机器生成代码的独特模式,因此其在代码场景中的适用性有所下降。本文系统研究了表征机器生成与人类编写代码的具体模式。通过对词法多样性、简洁性、自然性等代码属性的严谨分析,我们揭示了各来源代码的固有模式,并特别注意到代码的句法分割是识别其来源的关键因素。基于研究发现,我们提出DetectCodeGPT——一种新型机器生成代码检测方法,通过捕获代码独特的风格化模式改进了DetectGPT。有别于依赖外部LLM进行扰动的传统技术,DetectCodeGPT通过策略性插入空格和换行符来扰动代码语料库,兼顾了有效性与效率。实验结果表明,我们的方法在检测机器生成代码方面显著优于现有最先进技术。