This paper explores the risk that a large language model (LLM) trained for code generation on data mined from software repositories will generate content that discloses sensitive information included in its training data. We decompose this risk, known in the literature as ``unintended memorization,'' into two components: unintentional disclosure (where an LLM presents secrets to users without the user seeking them out) and malicious disclosure (where an LLM presents secrets to an attacker equipped with partial knowledge of the training data). We observe that while existing work mostly anticipates malicious disclosure, unintentional disclosure is also a concern. We describe methods to assess unintentional and malicious disclosure risks side-by-side across different releases of training datasets and models. We demonstrate these methods through an independent assessment of the Open Language Model (OLMo) family of models and its Dolma training datasets. Our results show, first, that changes in data source and processing are associated with substantial changes in unintended memorization risk; second, that the same set of operational changes may increase one risk while mitigating another; and, third, that the risk of disclosing sensitive information varies not only by prompt strategies or test datasets but also by the types of sensitive information. These contributions rely on data mining to enable greater privacy and security testing required for the LLM training data supply chain.
翻译:本文探讨了为代码生成而训练的大型语言模型(LLM)在基于软件仓库挖掘数据训练时,可能生成包含其训练数据中敏感信息内容的风险。我们将文献中称为“无意记忆”的这一风险分解为两个组成部分:无意泄露(LLM在用户未主动寻求的情况下向其呈现秘密信息)和恶意泄露(LLM向掌握部分训练数据知识的攻击者呈现秘密信息)。我们观察到,现有研究主要关注恶意泄露,但无意泄露同样值得重视。我们提出了并行评估不同训练数据集和模型版本中无意与恶意泄露风险的方法,并通过独立评估Open Language Model(OLMo)系列模型及其Dolma训练数据集展示了这些方法的应用。研究结果表明:首先,数据来源和处理方式的变化与无意记忆风险的显著变化相关;其次,同一组操作变更可能增加一种风险同时缓解另一种风险;第三,敏感信息泄露风险不仅因提示策略或测试数据集而异,还因敏感信息类型不同而变化。这些研究成果依托数据挖掘技术,为LLM训练数据供应链所需的隐私与安全测试提供了更强大的支持。