Code pre-trained language models (CPLMs) have received great attention since they can benefit various tasks that facilitate software development and maintenance. However, CPLMs are trained on massive open-source code, raising concerns about potential data infringement. This paper launches the first study of detecting unauthorized code use in CPLMs, i.e., Code Membership Inference (CMI) task. We design a framework Buzzer for different settings of CMI. Buzzer deploys several inference techniques, including distilling the target CPLM, ensemble inference, and unimodal and bimodal calibration. Extensive experiments show that CMI can be achieved with high accuracy using Buzzer. Hence, Buzzer can serve as a CMI tool and help protect intellectual property rights.
翻译:代码预训练语言模型(CPLMs)因能提升软件开发和维护中的多项任务而备受关注。然而,CPLMs基于海量开源代码训练,引发了潜在数据侵权问题。本文首次开展检测CPLMs中未经授权代码使用的研究,即代码成员推断(CMI)任务。我们针对CMI的不同场景设计了Buzzer框架,该框架集成了多种推断技术,包括目标CPLM蒸馏、集成推断以及单模态与双模态校准。大量实验表明,Buzzer能够以高准确率实现CMI,因此可作为CMI工具助力知识产权保护。