Traces of Memorisation in Large Language Models for Code

Large language models have gained significant popularity because of their ability to generate human-like text and potential applications in various fields, such as Software Engineering. Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet. The content of these datasets is memorised and can be extracted by attackers with data extraction attacks. In this work, we explore memorisation in large language models for code and compare the rate of memorisation with large language models trained on natural language. We adopt an existing benchmark for natural language and construct a benchmark for code by identifying samples that are vulnerable to attack. We run both benchmarks against a variety of models, and perform a data extraction attack. We find that large language models for code are vulnerable to data extraction attacks, like their natural language counterparts. From the training data that was identified to be potentially extractable we were able to extract 47% from a CodeGen-Mono-16B code completion model. We also observe that models memorise more, as their parameter count grows, and that their pre-training data are also vulnerable to attack. We also find that data carriers are memorised at a higher rate than regular code or documentation and that different model architectures memorise different samples. Data leakage has severe outcomes, so we urge the research community to further investigate the extent of this phenomenon using a wider range of models and extraction techniques in order to build safeguards to mitigate this issue.

翻译：大语言模型因其生成类似人类文本的能力以及在软件工程等多个领域的潜在应用而广受欢迎。面向代码的大语言模型通常基于从互联网抓取的大量未经清洗的源代码语料库进行训练。这些数据集的内容会被记忆，并可能被攻击者通过数据提取攻击获取。在本研究中，我们探讨了面向代码的大语言模型中的记忆现象，并将记忆率与基于自然语言训练的大语言模型进行了比较。我们采用现有的自然语言基准，并通过识别易受攻击的样本构建了一个面向代码的基准。我们对多种模型运行这两个基准，并实施数据提取攻击。研究发现，面向代码的大语言模型与自然语言模型类似，易受数据提取攻击。在已被识别为潜在可提取的训练数据中，我们能够从CodeGen-Mono-16B代码补全模型中提取47%的数据。同时观察到，随着参数数量的增加，模型记忆的内容也更多，且其预训练数据同样易受攻击。此外，我们发现数据载体（如标识符、常量等）的记忆率高于常规代码或文档，而不同模型架构记忆的样本也有所不同。数据泄露会带来严重后果，因此我们敦促研究社区通过更广泛的模型和提取技术进一步探究该现象的程度，以建立缓解这一问题的防护机制。