The widespread use of Large Language Models (LLMs) in society creates new information security challenges for developers, organizations, and end-users alike. LLMs are trained on large volumes of data, and their susceptibility to reveal the exact contents of the source training datasets poses security and safety risks. Although current alignment procedures restrict common risky behaviors, they do not completely prevent LLMs from leaking data. Prior work demonstrated that LLMs may be tricked into divulging training data by using out-of-distribution queries or adversarial techniques. In this paper, we demonstrate a simple, query-based decompositional method to extract news articles from two frontier LLMs. We use instruction decomposition techniques to incrementally extract fragments of training data. Out of 3723 New York Times articles, we extract at least one verbatim sentence from 73 articles, and over 20% of verbatim sentences from 6 articles. Our analysis demonstrates that this method successfully induces the LLM to generate texts that are reliable reproductions of news articles, meaning that they likely originate from the source training dataset. This method is simple, generalizable, and does not fine-tune or change the production model. If replicable at scale, this training data extraction methodology could expose new LLM security and safety vulnerabilities, including privacy risks and unauthorized data leaks. These implications require careful consideration from model development to its end-use.
翻译:大型语言模型(LLM)在社会中的广泛应用为开发者、组织和终端用户带来了新的信息安全挑战。LLM基于海量数据进行训练,其可能泄露源训练数据集确切内容的特点构成了安全风险。尽管当前的对齐流程限制了常见的危险行为,但并未完全阻止LLM的数据泄露。已有研究表明,通过使用分布外查询或对抗性技术可能诱使LLM泄露训练数据。本文提出一种基于查询的简单分解方法,从两个前沿LLM中提取新闻文章。我们采用指令分解技术逐步提取训练数据片段:在3723篇《纽约时报》文章中,成功从73篇文章中提取出至少一个逐句复现的句子,并从6篇文章中提取超过20%的逐句内容。分析表明,该方法能有效诱导LLM生成可信的新闻文章复现文本,这意味着这些文本很可能源自原始训练数据集。该方法具有简洁性、可推广性,且无需对生产模型进行微调或修改。若能在规模上复现,这种训练数据提取方法可能暴露LLM新的安全漏洞,包括隐私风险和未授权数据泄露。这些影响需要从模型开发到终端应用的各个环节进行审慎考量。