Traces of Memorisation in Large Language Models for Code

Large language models have gained significant popularity because of their ability to generate human-like text and potential applications in various fields, such as Software Engineering. Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet. The content of these datasets is memorised and can be extracted by attackers with data extraction attacks. In this work, we explore memorisation in large language models for code and compare the rate of memorisation with large language models trained on natural language. We adopt an existing benchmark for natural language and construct a benchmark for code by identifying samples that are vulnerable to attack. We run both benchmarks against a variety of models, and perform a data extraction attack. We find that large language models for code are vulnerable to data extraction attacks, like their natural language counterparts. From the training data that was identified to be potentially extractable we were able to extract 47% from a CodeGen-Mono-16B code completion model. We also observe that models memorise more, as their parameter count grows, and that their pre-training data are also vulnerable to attack. We also find that data carriers are memorised at a higher rate than regular code or documentation and that different model architectures memorise different samples. Data leakage has severe outcomes, so we urge the research community to further investigate the extent of this phenomenon using a wider range of models and extraction techniques in order to build safeguards to mitigate this issue.

翻译：大型语言模型因其生成类人文本的能力以及在软件工程等多个领域的潜在应用而广受欢迎。用于代码的大型语言模型通常基于从互联网抓取的大量未清洗源代码语料库进行训练。这些数据集的内容会被模型记忆，并通过数据提取攻击被攻击者获取。本文探索了代码大语言模型中的记忆现象，并将其记忆率与基于自然语言训练的大语言模型进行了比较。我们采用现有的自然语言基准，并通过识别易受攻击的样本构建了一个代码基准。我们针对多种模型运行这两个基准，并实施了数据提取攻击。研究发现，代码大语言模型与自然语言模型一样易受数据提取攻击。在识别为可能被提取的训练数据中，我们成功从CodeGen-Mono-16B代码补全模型中提取了47%的数据。我们还观察到，模型参数规模越大，记忆的内容越多，且其预训练数据同样容易受到攻击。此外，数据载体被记忆的比率高于常规代码或文档，而不同模型架构会记忆不同的样本。数据泄露会引发严重后果，因此我们敦促研究界通过更广泛的模型和提取技术深入探究这一现象的规模，从而建立缓解该问题的防护措施。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日