Large language models have gained significant popularity because of their ability to generate human-like text and potential applications in various fields, such as Software Engineering. Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet. The content of these datasets is memorised and can be extracted by attackers with data extraction attacks. In this work, we explore memorisation in large language models for code and compare the rate of memorisation with large language models trained on natural language. We adopt an existing benchmark for natural language and construct a benchmark for code by identifying samples that are vulnerable to attack. We run both benchmarks against a variety of models, and perform a data extraction attack. We find that large language models for code are vulnerable to data extraction attacks, like their natural language counterparts. From the training data that was identified to be potentially extractable we were able to extract 47% from a CodeGen-Mono-16B code completion model. We also observe that models memorise more, as their parameter count grows, and that their pre-training data are also vulnerable to attack. We also find that data carriers are memorised at a higher rate than regular code or documentation and that different model architectures memorise different samples. Data leakage has severe outcomes, so we urge the research community to further investigate the extent of this phenomenon using a wider range of models and extraction techniques in order to build safeguards to mitigate this issue.
翻译:大型语言模型因其生成类人文本的能力以及在软件工程等多个领域的潜在应用而广受欢迎。用于代码的大型语言模型通常基于从互联网抓取的大量未清洗源代码语料库进行训练。这些数据集的内容会被模型记忆,并通过数据提取攻击被攻击者获取。本文探索了代码大语言模型中的记忆现象,并将其记忆率与基于自然语言训练的大语言模型进行了比较。我们采用现有的自然语言基准,并通过识别易受攻击的样本构建了一个代码基准。我们针对多种模型运行这两个基准,并实施了数据提取攻击。研究发现,代码大语言模型与自然语言模型一样易受数据提取攻击。在识别为可能被提取的训练数据中,我们成功从CodeGen-Mono-16B代码补全模型中提取了47%的数据。我们还观察到,模型参数规模越大,记忆的内容越多,且其预训练数据同样容易受到攻击。此外,数据载体被记忆的比率高于常规代码或文档,而不同模型架构会记忆不同的样本。数据泄露会引发严重后果,因此我们敦促研究界通过更广泛的模型和提取技术深入探究这一现象的规模,从而建立缓解该问题的防护措施。